Data dependency detection bug with arm-none-eabi-gcc and -Ofast

Data dependency detection bug with arm-none-eabi-gcc and -Ofast - c

I am writing a program for an Arm Cortex M7 based microcontroller, where I am facing an issue with (I guess) compiler optimization.
I compile using arm-none-eabi-gcc v10.2.1 with -mcpu=cortex-m7 -std=gnu11 -Ofast -mthumb.
I extracted the relevant parts, they are as below. I copy some data from a global variable into a local one, and from there the data is written to a hardware
CRC engine to compute the CRC over the data. Before writing it to the engine, the byte order is to be reversed.
#include <stddef.h>
#include <stdint.h>
#define MEMORY_BARRIER asm volatile("" : : : "memory")
#define __REV(x) __builtin_bswap32(x) // Reverse byte order
// Dummy struct, similiar as it is used in our program
typedef struct {
float n1, n2, n3, dn1, dn2, dn3;
} test_struct_t;
// CRC Engine Registers
typedef struct {
volatile uint32_t DR; // CRC Data register Address offset: 0x00
} CRC_TypeDef;
#define CRC ((CRC_TypeDef*) (0x40000000UL))
// Write data to CRC engine
static inline void CRC_write(uint32_t* data, size_t size)
{
for(int index = 0; index < (size / 4); index++) {
CRC->DR = __REV(*(data + index));
}
}
// Global variable holding some data, similiar as it is used in our program
volatile test_struct_t global_var = { 0 };
int main()
{
test_struct_t local_tmp;
MEMORY_BARRIER;
// Copy the current data into a local variable
local_tmp = global_var;
// Now, write the data to the CRC engine to compute CRC over the data
CRC_write((uint32_t*) &local_tmp, sizeof(local_tmp));
MEMORY_BARRIER;
return 0;
}
This compiles to:
main:
push {r4, r5, lr}
sub sp, sp, #28
ldr r4, .L4
mov ip, sp
ldr r3, [sp] ; load first float into r3
mov lr, #1073741824 ; load address of CRC engine DR register into lr
rev r5, r3 ; reverse byte order of first float and store in r5
ldmia r4!, {r0, r1, r2, r3} ; load first 4 floats into r0-r3
stmia ip!, {r0, r1, r2, r3} ; copy first 4 floats into local variable
ldm r4, {r0, r1} ; load float 5 and 6 into r0 and r1
stm ip, {r0, r1} ; copy float 5 and 6 into local variable
str r5, [lr] ; write reversed bytes from r5 to CRC engine
ldr r3, [sp, #4]
rev r3, r3
str r3, [lr]
ldr r3, [sp, #8]
rev r3, r3
str r3, [lr]
ldr r3, [sp, #12]
rev r3, r3
str r3, [lr]
ldr r3, [sp, #16]
rev r3, r3
str r3, [lr]
ldr r3, [sp, #20]
rev r3, r3
str r3, [lr]
movs r0, #0
add sp, sp, #28
pop {r4, r5, pc}
.L4:
.word .LANCHOR0
global_var:
The problem is now that the first float is loaded and its byte order is reversed (rev r5, r3) before the data has been copied from the global variable (happens with the ldmia/sdmia).
This is clearly not intended, leads to a wrong CRC, and I wonder why the compiler does not recognize this data dependency. Am I doing anything wrong or is there a bug in GCC?
Code in compiler explorer: https://godbolt.org/z/ErWThc5dx

Related

How do i know if the compiler will optimize a variable?

I am new to Microcontrollers. I have read a lot of Articles and documentations about volatile variables in c. What i understood, is that while using volatile we are telling the compiler not to cache either to optimize the variable. However i still didnt get when this should really be used.For example let's say i have a simple counter and for loop like this.
for(int i=0; i < blabla.length; i++) {
//code here
}
or maybe when i write a simple piece of code like this
int i=1;
int j=1;
printf("the sum is: %d\n" i+j);
I have never cared about compiler optimization for such examples. But in many scopes if the variable is not declared as volatile the ouptut wont be as expected. How would i know that i have to care about compiler optimization in other examples?

Simple example:
int flag = 1;
while (flag)
{
do something that doesn't involve flag
}
This can be optimized to:
while (true)
{
do something
}
because the compiler knows that flag never changes.
with this code:
volatile int flag = 1;
while (flag)
{
do something that doesn't involve flag
}
nothing will be optimized, because now the compiler knows: "although the program doesn't change flag inside the while loop, it might changed anyway".

According to cppreference:
volatile object - an object whose type is volatile-qualified, or a subobject of a volatile object, or a mutable subobject of a const-volatile object. Every access (read or write operation, member function call, etc.) made through a glvalue expression of volatile-qualified type is treated as a visible side-effect for the purposes of optimization (that is, within a single thread of execution, volatile accesses cannot be optimized out or reordered with another visible side effect that is sequenced-before or sequenced-after the volatile access. This makes volatile objects suitable for communication with a signal handler, but not with another thread of execution, see std::memory_order). Any attempt to refer to a volatile object through a non-volatile glvalue (e.g. through a reference or pointer to non-volatile type) results in undefined behavior.
This explains why some optimizations can’t be made by the compiler since it can’t entirely predict when its value will be modified at compile-time. This qualifier is useful to indicate to the compiler that it shouldn’t do these optimizations because its value can be changed in a way unknown by the compiler.
I have not worked recently with microcontrollers but I think that the states of different electrical input and output pins have to be marked as volatile since the compiler doesn’t know that they can be changed externally. (In this case by means other than code like when you plug-in a component).

Just try it. First off there is the language and what is possible to be optimized and then there is what the compiler actual figures out and optimizes, if it can be optimized does not mean the compiler will figure it out nor will it always produce the code you think.
Volatile has nothing to do with caching of any kind, did not we just get this question recently using that term? Volatile indicates to the compiler that the variable should not be optimized into a register or optimized away. Let us say "all" accesses to that variable must go back to memory, although different compilers have a different understanding of how to use volatile, I have seen clang (llvm) and gcc (gnu) disagree, when the variable was used twice in a row or something like that clang didnt do two reads it only did one.
It was a Stack Overflow question you are welcome to search for it, the clang code was slightly faster than gcc, simply because of one less instruction because of differences of opinion of how to implement volatile. So even there the main compiler folks can't agree on what it really means. Its the nature of the C language, lots of implementation defined features and pro tip, avoid them volatile, bitfields, unions, etc, certainly across compile domains.
void fun0 ( void )
{
unsigned int i;
unsigned int len;
len = 5;
for(i=0; i < len; i++)
{
}
}
00000000 <fun0>:
0: 4770 bx lr
This is completely dead code, it does noting it touches nothing, all the items are local, so it can all go away, simply return.
unsigned int fun1 ( void )
{
unsigned int i;
unsigned int len;
len = 5;
for(i=0; i < len; i++)
{
}
return i;
}
00000004 <fun1>:
4: 2005 movs r0, #5
6: 4770 bx lr
This one returns something, the compiler can figure out it is counting and the last value after the loop is what gets returned....so just return that value, no need for variables or any other code generation, the rest is dead code.
unsigned int fun2 ( unsigned int len )
{
unsigned int i;
for(i=0; i < len; i++)
{
}
return i;
}
00000008 <fun2>:
8: 4770 bx lr
Like fun1 except the value is passed in in a register, just happens to be the same register as the return value for the ABI for this target. So you do not even have to copy the length to the return value in this case, for other architectures or ABIs we would hope that this optimizes to return = len and that gets sent back. A simple mov instruction.
unsigned int fun3 ( unsigned int len )
{
volatile unsigned int i;
for(i=0; i < len; i++)
{
}
return i;
}
0000000c <fun3>:
c: 2300 movs r3, #0
e: b082 sub sp, #8
10: 9301 str r3, [sp, #4]
12: 9b01 ldr r3, [sp, #4]
14: 4298 cmp r0, r3
16: d905 bls.n 24 <fun3+0x18>
18: 9b01 ldr r3, [sp, #4]
1a: 3301 adds r3, #1
1c: 9301 str r3, [sp, #4]
1e: 9b01 ldr r3, [sp, #4]
20: 4283 cmp r3, r0
22: d3f9 bcc.n 18 <fun3+0xc>
24: 9801 ldr r0, [sp, #4]
26: b002 add sp, #8
28: 4770 bx lr
2a: 46c0 nop ; (mov r8, r8)
it gets significantly different here, that is a lot of code compared to the ones thus far. We would like to think that volatile indicates all uses of that variable touch the memory for that variable.
12: 9b01 ldr r3, [sp, #4]
14: 4298 cmp r0, r3
16: d905 bls.n 24 <fun3+0x18>
get i and compare it to len is it less than? we are done exit loop
18: 9b01 ldr r3, [sp, #4]
1a: 3301 adds r3, #1
1c: 9301 str r3, [sp, #4]
i was less than len so we need to increment it, read it, change it, write it back.
1e: 9b01 ldr r3, [sp, #4]
20: 4283 cmp r3, r0
22: d3f9 bcc.n 18 <fun3+0xc>
do the i < len test again, see if it is less than or greater than and loop again or do not.
24: 9801 ldr r0, [sp, #4]
get i from ram so it can be returned.
All reads and writes of i involved the memory that holds i. Because we asked for that now the loop is not dead code each iteration has to be implemented in order to handle all the touches of that variable on memory.
void fun4 ( void )
{
unsigned int a;
unsigned int b;
a = 1;
b = 1;
fun3(a+b);
}
0000002c <fun4>:
2c: 2300 movs r3, #0
2e: b082 sub sp, #8
30: 9301 str r3, [sp, #4]
32: 9b01 ldr r3, [sp, #4]
34: 2b01 cmp r3, #1
36: d805 bhi.n 44 <fun4+0x18>
38: 9b01 ldr r3, [sp, #4]
3a: 3301 adds r3, #1
3c: 9301 str r3, [sp, #4]
3e: 9b01 ldr r3, [sp, #4]
40: 2b01 cmp r3, #1
42: d9f9 bls.n 38 <fun4+0xc>
44: 9b01 ldr r3, [sp, #4]
46: b002 add sp, #8
48: 4770 bx lr
4a: 46c0 nop ; (mov r8, r8)
this both optimized out the addition and the a and b variables but also optimized by inlining the fun3 function.
void fun5 ( void )
{
volatile unsigned int a;
unsigned int b;
a = 1;
b = 1;
fun3(a+b);
}
0000004c <fun5>:
4c: 2301 movs r3, #1
4e: b082 sub sp, #8
50: 9300 str r3, [sp, #0]
52: 2300 movs r3, #0
54: 9a00 ldr r2, [sp, #0]
56: 9301 str r3, [sp, #4]
58: 9b01 ldr r3, [sp, #4]
5a: 3201 adds r2, #1
5c: 429a cmp r2, r3
5e: d905 bls.n 6c <fun5+0x20>
60: 9b01 ldr r3, [sp, #4]
62: 3301 adds r3, #1
64: 9301 str r3, [sp, #4]
66: 9b01 ldr r3, [sp, #4]
68: 429a cmp r2, r3
6a: d8f9 bhi.n 60 <fun5+0x14>
6c: 9b01 ldr r3, [sp, #4]
6e: b002 add sp, #8
70: 4770 bx lr
Also fun3 is inlined, but the a variable is read from memory every time
instead of being optimized out
58: 9b01 ldr r3, [sp, #4]
5a: 3201 adds r2, #1
void fun6 ( void )
{
unsigned int i;
unsigned int len;
len = 5;
for(i=0; i < len; i++)
{
fun3(i);
}
}
00000074 <fun6>:
74: 2300 movs r3, #0
76: 2200 movs r2, #0
78: 2100 movs r1, #0
7a: b082 sub sp, #8
7c: 9301 str r3, [sp, #4]
7e: 9b01 ldr r3, [sp, #4]
80: 3201 adds r2, #1
82: 9b01 ldr r3, [sp, #4]
84: 2a05 cmp r2, #5
86: d00d beq.n a4 <fun6+0x30>
88: 9101 str r1, [sp, #4]
8a: 9b01 ldr r3, [sp, #4]
8c: 4293 cmp r3, r2
8e: d2f7 bcs.n 80 <fun6+0xc>
90: 9b01 ldr r3, [sp, #4]
92: 3301 adds r3, #1
94: 9301 str r3, [sp, #4]
96: 9b01 ldr r3, [sp, #4]
98: 429a cmp r2, r3
9a: d8f9 bhi.n 90 <fun6+0x1c>
9c: 3201 adds r2, #1
9e: 9b01 ldr r3, [sp, #4]
a0: 2a05 cmp r2, #5
a2: d1f1 bne.n 88 <fun6+0x14>
a4: b002 add sp, #8
a6: 4770 bx lr
This one I found interesting, could have been optimized better, based on my gnu experience kind of confused, but as pointed out, this is how it is, you can expect one thing but the compiler does what it does.
9c: 3201 adds r2, #1
9e: 9b01 ldr r3, [sp, #4]
a0: 2a05 cmp r2, #5
The i variable in the fun6 function is put on the stack for some reason, it is not volatile it does not desire that kind of access every time. But that is how they implemented it.
If I build with an older version of gcc I see this
9c: 3201 adds r2, #1
9e: 9b01 ldr r3, [sp, #4]
a0: 2a05 cmp r2, #5
Another thing to note is that gnu at least is not getting better every version, it has been at times getting worse, this is a simple case.
void fun7 ( void )
{
unsigned int i;
unsigned int len;
len = 5;
for(i=0; i < len; i++)
{
fun2(i);
}
}
0000013c <fun7>:
13c: e12fff1e bx lr
Okay too extreme (no surprise in the result), let us try this
void more_fun ( unsigned int );
void fun8 ( void )
{
unsigned int i;
unsigned int len;
len = 5;
for(i=0; i < len; i++)
{
more_fun(i);
}
}
000000ac <fun8>:
ac: b510 push {r4, lr}
ae: 2000 movs r0, #0
b0: f7ff fffe bl 0 <more_fun>
b4: 2001 movs r0, #1
b6: f7ff fffe bl 0 <more_fun>
ba: 2002 movs r0, #2
bc: f7ff fffe bl 0 <more_fun>
c0: 2003 movs r0, #3
c2: f7ff fffe bl 0 <more_fun>
c6: 2004 movs r0, #4
c8: f7ff fffe bl 0 <more_fun>
cc: bd10 pop {r4, pc}
ce: 46c0 nop ; (mov r8, r8)
No surprise there it chose to unroll it because 5 is below some threshold.
void fun9 ( unsigned int len )
{
unsigned int i;
for(i=0; i < len; i++)
{
more_fun(i);
}
}
000000d0 <fun9>:
d0: b570 push {r4, r5, r6, lr}
d2: 1e05 subs r5, r0, #0
d4: d006 beq.n e4 <fun9+0x14>
d6: 2400 movs r4, #0
d8: 0020 movs r0, r4
da: 3401 adds r4, #1
dc: f7ff fffe bl 0 <more_fun>
e0: 42a5 cmp r5, r4
e2: d1f9 bne.n d8 <fun9+0x8>
e4: bd70 pop {r4, r5, r6, pc}
That is what I was looking for. So in this case the i variable is in a register (r4) not on the stack as shown above. The calling convention for this says r4 and some number of others after it (r5,r6,...) must be preserved. This is calling an external function which the optimizer can't see, so it has to implement the loop so that the function is called that many times with each of the values in order. Not dead code.
Textbook/classroom implies that local variables are on the stack, but they do not have to be. i is not declared volatile so instead take a non-volatile register, r4 save that on the stack so the caller does not lose its state, use r4 as i and the callee function more_fun either will not touch it or will return it as it found it. You add a push, but save a bunch of loads and stores in the loop, yet another optimization based on the target and the ABI.
Volatile is a suggestion/recommendation/desire to the compiler that it have an address for the variable and perform actual load and store accesses to that variable when used. Ideally for use cases like when you have a control/status register in a peripheral in hardware that you need all of the accesses described in the code to happen in the order coded, no optimization. As to a cache that is independent of the language you have to setup the cache and the mmu or other solution so that control and status registers do not get cached and the peripheral is not touched when we wanted it to be touched. Takes both layers you need to tell the compiler to do all the accesses and need to not block those accesses in the memory system.
Without volatile and based on the command line options you use and the list of optimizations the compiler has been programmed to attempt to perform the compiler will try to perform those optimizations as they are programmed in the compilers code. If the compiler can't see into a calling function like more_fun above because it is not in this optimization domain then the compiler must functionally represent all the calls in order, if it can see and inlining is allowed then the compiler can if programmed to do so essentially pull the function inline with the caller THEN optimize that whole blob as if it were one function based on other available options. Not uncommon to have the callee function be bulky because of its nature, but when specific values are passed by a caller and the compiler can see all of it the caller plus callee code can be smaller than the callee implementation.
You will often see folks wanting to for example learn assembly language by examining the output of a compiler do something like this:
void fun10 ( void )
{
int a;
int b;
int c;
a = 5;
b = 6;
c = a + b;
}
not realizing that that is dead code and should be optimized out if an optimizer is used, they ask a Stack Overflow question and someone says you need to turn the optimizer off, now you get a lot of loads and stores have to understand and keep track of stack offsets and while it is valid asm code you can study it is not what you were hoping for, instead something like this is more valuable to that effort
unsigned int fun11 ( unsigned int a, unsigned int b )
{
return(a+b);
}
The inputs are unknown to the compiler and a return value is required so it can't dead code this it has to implement it.
And this is a simple case of demonstrating the caller plus callee is smaller than the callee
000000ec <fun11>:
ec: 1840 adds r0, r0, r1
ee: 4770 bx lr
000000f0 <fun12>:
f0: 2007 movs r0, #7
f2: 4770 bx lr
While that may not look simpler it has inlined the code, it has optimized out the a = 3, b = 4 assignments, optimized out the addition operation and simply pre-computed the result and returned it.
Certainly with gcc you can cherry pick the optimizations you want to add or block there is a laundry list of them that you can go research.
With very little practice you can see what is optimizable at least within the view of the function but then hope the compiler figures it out. Certainly visualizing inline takes more work but really it is the same you just visually inline it.
Now there are ways with gnu and llvm to optimize across files, basically whole project so more_fun would be visible now and the functions that call it might get further optimized than what you see in the object of the one file with the caller. Takes certain command lines on the compile and/or link for this to work and I have not memorized them. With llvm there is a way to merge bytecode and then optimize that, but it does not always do what you hoped it would do as far as a whole project optimization.

ARM Cortex M7 unaligned access and memcpy

I am compiling this code for a Cortex M7 using GCC:
// copy manually
void write_test_plain(uint8_t * ptr, uint32_t value)
{
*ptr++ = (u8)(value);
*ptr++ = (u8)(value >> 8);
*ptr++ = (u8)(value >> 16);
*ptr++ = (u8)(value >> 24);
}
// copy using memcpy
void write_test_memcpy(uint8_t * ptr, uint32_t value)
{
void *px = (void*)&value;
memcpy(ptr, px, 4);
}
int main(void)
{
extern uint8_t data[];
extern uint32_t value;
// i added some offsets to data to
// make sure the compiler cannot
// assume it's aligned in memory
write_test_plain(data + 2, value);
__asm volatile("": : :"memory"); // just to split inlined calls
write_test_memcpy(data + 5, value);
... do something with data ...
}
And I get the following Thumb2 assembly with -O2:
// write_test_plain(data + 2, value);
800031c: 2478 movs r4, #120 ; 0x78
800031e: 2056 movs r0, #86 ; 0x56
8000320: 2134 movs r1, #52 ; 0x34
8000322: 2212 movs r2, #18 ; 0x12
8000324: 759c strb r4, [r3, #22]
8000326: 75d8 strb r0, [r3, #23]
8000328: 7619 strb r1, [r3, #24]
800032a: 765a strb r2, [r3, #25]
// write_test_memcpy(data + 5, value);
800032c: 4ac4 ldr r2, [pc, #784] ; (8000640 <main+0x3a0>)
800032e: 923b str r2, [sp, #236] ; 0xec
8000330: 983b ldr r0, [sp, #236] ; 0xec
8000332: f8c3 0019 str.w r0, [r3, #25]
Can someone explain how the memcpy version works? This looks like inlined 32-bit store to the destination address, but isn't this a problem since data + 5 is most certainly not aligned to a 4-byte boundary?
Is this perhaps some optimization which happens due to some undefined behavior in my source?

For Cortex-M processors unaligned loads and stores of bytes, half-words, and words are usually allowed and most compilers use this when generating code unless they are instructed not to. If you want to prevent gcc from assuming the unaligned accesses are OK, you can use the -mno-unaligned-access compiler flag.
If you specify this flag gcc will no longer inline the call to memcpy and write_test_memcpy looks like
write_test_memcpy(unsigned char*, unsigned long):
push {lr}
sub sp, sp, #12
movs r2, #4
add r3, sp, #8
str r1, [r3, #-4]!
mov r1, r3
bl memcpy
add sp, sp, #12
ldr pc, [sp], #4

Cortex-M 7 , M4, M3 M33, M23 does support unaligned access
M0, M+ doesn't support unaligned access
however you can disable the support of unaligned access in cortexm7 by setting bit UNALIGN_TRP in configuration and control register and any unaligned access will generate usage fault.
From compiler perspective, default setting is that generated assembly code does unaligned access unless you disable this by using the compile flag -mno-unaligned-access

Why the interrupt service routine ,PUSH {r3,r4,r5,lr} but POP {r0,r4,r5,lr},which lead to ERROR?

I am using IAR to compile routines, but run error on ARM A7; then i got the question below when i open the .lst file generated by IAR.
It is a ISR, first push {r3, r4, r5, lr}, but POP {r0, r4, r5, lr} when return, the R0 value is changed to the value of R3 before push. So R0 is wrong when returned from irqHandler which lead to error in follow routines.
why ?
void irqHandler(void)
{
878: e92d4038 push {r3, r4, r5, lr}
volatile u32 *pt = (u32 *)AM_INTC_BASE;
87c: e3a044b0 mov r4, #176, 8 ; 0xb0000000
u32 id_spin;
id_spin = *(pt+0x200c/4) & 0x3ff;
880: e302000c movw r0, #8204 ; 0x200c
884: e7900004 ldr r0, [r0, r4]
888: e1b00b00 lsls r0, r0, #22
88c: e1b00b20 lsrs r0, r0, #22
890: e1b05000 movs r5, r0
if(id_spin<32)
894: e3550020 cmp r5, #32
898: 2a000000 bcs 8a0 <irqHandler+0x28>
{
#ifdef WHOLECHIPSIM
print("id_spid<32 error...\r\n",0);
#endif
while(1);
89c: eafffffe b 89c <irqHandler+0x24>
}
else
{
(pFuncIrq[id_spin-32])();
8a0: e59f0010 ldr r0, [pc, #16] ; 8b8 <.text_8>
8a4: e1b01105 lsls r1, r5, #2
8a8: e0910000 adds r0, r1, r0
8ac: e5100080 ldr r0, [r0, #-128] ; 0x80
8b0: e12fff30 blx r0
}
}
8b4: e8bd8031 pop {r0, r4, r5, pc}

The abi requires a 64 bit aligned stack, so the push of r3 simply facilitates that. Could have chosen any register not already specified. Likewise on the pop they need to clean up the stack the function is prototyped as void so the return (r0) is a dont care and r0-r3 are not expected to be preserved so no reason to match the r3 on each end nor match an r0 on each end.
had they chose a register numbered above r3 (r6 for example) on the push then that would have needed to be matched on the pop. Otherwise the pop would have to be one of r0-r3 to not trash a non-volatile register. (couldnt push r3 then pop r6 that would trash r6)

It does not matter as R0-R3, R12, LR, PC, xPSR are saved on the stack automaticly when the hardware invokes the interrupt vector routine. When bx, ldm, pop, or ldr with PC is invoked hardware executes interrupt routine exit poping those registers.
Do not check your compiler. It knows what it does. Check tour wrong logic - especially printing strings in the interrupt handler.

assemble code with the keyword __irq __arm is below:
__irq __arm void irqHandler(void)
{
878: e24ee004 sub lr, lr, #4
87c: e92d503f push {r0, r1, r2, r3, r4, r5, ip, lr}
volatile u32 *pt = (u32 *)AM_INTC_BASE;
880: e3a044b0 mov r4, #176, 8 ; 0xb0000000
u32 id_spin;
id_spin = *(pt+0x200c/4) & 0x3ff;
884: e302000c movw r0, #8204 ; 0x200c
888: e7900004 ldr r0, [r0, r4]
88c: e1b00b00 lsls r0, r0, #22
890: e1b00b20 lsrs r0, r0, #22
894: e1b05000 movs r5, r0
if(id_spin<32)
898: e3550020 cmp r5, #32
89c: 2a000000 bcs 8a4 <irqHandler+0x2c>
{
#ifdef WHOLECHIPSIM
print("id_spid<32 error...\r\n",0);
#endif
while(1);
8a0: eafffffe b 8a0 <irqHandler+0x28>
}
else
{
(pFuncIrq[id_spin-32])();
8a4: e59f0010 ldr r0, [pc, #16] ; 8bc <.text_8>
8a8: e1b01105 lsls r1, r5, #2
8ac: e0910000 adds r0, r1, r0
8b0: e5100080 ldr r0, [r0, #-128] ; 0x80
8b4: e12fff30 blx r0
}
}
8b8: e8fd903f ldm sp!, {r0, r1, r2, r3, r4, r5, ip, pc}^

Cortex A7 PUSH log ,it just push 7 register, so 32bit aligned is ok
follow link is the log info:
http://img.blog.csdn.net/20170819120758443?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvcmFpbmJvd2JpcmRzX2Flcw==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center

LPC810 - why won't machine code execute from an array of uint8_t in flash?

I'm writing embedded C/assembler code for the NXP LPC810 microcontroller (just a hobby project).
I have a function fn. I also have an exact copy of that function's machine code in an array of uint8_t. (I have checked the hex file.)
I create a function pointer fnptr, with the same type as fn and point it at the array, using a cast.
It all cross-compiles without warnings.
When the MCU executes fn it works correctly.
When the MCU executes fnptr it crashes (I can't see any debug, as there are only 8 pins, all in use).
The code is position independent.
The array has the correct 4 byte alignment.
fn is in the .text section of the elf file.
The array is forced into the .text section of the elf file (still in flash, not RAM).
I have assumed that there is no NX-like functionality on such a basic Coretex M0+ MCU. (Cortex M3 and M4 do have some form of read-only memory protection for code.)
Are there other reasons why the machine code in the array does not work?
Update:
Here is the code:
#include "stdio.h"
#include "serial.h"
extern "C" void SysTick_Handler() {
// generate an interrupt for delay
}
void delay(int millis) {
while (--millis >= 0) {
__WFI(); // wait for SysTick interrupt
}
}
extern "C" int fn(int a, int b) {
return a + b;
}
/* arm-none-eabi-objdump -d firmware.elf
00000162 <fn>:
162: 1840 adds r0, r0, r1
164: 4770 bx lr
166: 46c0 nop ; (mov r8, r8)
*/
extern "C" const uint8_t machine_code[6] __attribute__((aligned (4))) __attribute__((section (".text"))) = {
0x40,0x18,
0x70,0x47,
0xc0,0x46
};
int main() {
LPC_SWM->PINASSIGN0 = 0xFFFFFF04UL;
serial.init(LPC_USART0, 115200);
SysTick_Config(12000000/1000); // 1ms ticks
int(*fnptr)(int a, int b) = (int(*)(int, int))machine_code;
for (int a = 0; ; a++) {
int c = fnptr(a, 1000000);
printf("Hello world2 %d.\n", c);
delay(1000);
}
}
And here is the disassembled output from arm-none-eabi-objdump -D -Mforce-thumb firmware.elf:
00000162 <fn>:
162: 1840 adds r0, r0, r1
164: 4770 bx lr
166: 46c0 nop ; (mov r8, r8)
00000168 <machine_code>:
168: 1840 adds r0, r0, r1
16a: 4770 bx lr
16c: 46c0 nop ; (mov r8, r8)
16e: 46c0 nop ; (mov r8, r8)
00000170 <main>:
...

I amended the code to call the original fn though a function pointer too, in order to be able to generate working and non-working assembly code that was hopefully near-identical.
machine_code has become much longer, as I am now using no optimisation (-O0).
#include "stdio.h"
#include "serial.h"
extern "C" void SysTick_Handler() {
// generate an interrupt for delay
}
void delay(int millis) {
while (--millis >= 0) {
__WFI(); // wait for SysTick interrupt
}
}
extern "C" int fn(int a, int b) {
return a + b;
}
/*
000002bc <fn>:
2bc: b580 push {r7, lr}
2be: b082 sub sp, #8
2c0: af00 add r7, sp, #0
2c2: 6078 str r0, [r7, #4]
2c4: 6039 str r1, [r7, #0]
2c6: 687a ldr r2, [r7, #4]
2c8: 683b ldr r3, [r7, #0]
2ca: 18d3 adds r3, r2, r3
2cc: 1c18 adds r0, r3, #0
2ce: 46bd mov sp, r7
2d0: b002 add sp, #8
2d2: bd80 pop {r7, pc}
*/
extern "C" const uint8_t machine_code[24] __attribute__((aligned (4))) __attribute__((section (".text"))) = {
0x80,0xb5,
0x82,0xb0,
0x00,0xaf,
0x78,0x60,
0x39,0x60,
0x7a,0x68,
0x3b,0x68,
0xd3,0x18,
0x18,0x1c,
0xbd,0x46,
0x02,0xb0,
0x80,0xbd
};
int main() {
LPC_SWM->PINASSIGN0 = 0xFFFFFF04UL;
serial.init(LPC_USART0, 115200);
SysTick_Config(12000000/1000); // 1ms ticks
int(*fnptr)(int a, int b) = (int(*)(int, int))fn;
//int(*fnptr)(int a, int b) = (int(*)(int, int))machine_code;
for (int a = 0; ; a++) {
int c = fnptr(a, 1000000);
printf("Hello world2 %d.\n", c);
delay(1000);
}
}
I compiled the code above, generating firmware.fn.elf and firmware.machinecode.elf by uncommenting //int(*fnptr)(int a, int b) = (int(*)(int, int))machine_code; (and commenting-out the line above).
The first code (fn) worked, the second code (machine_code) crashed.
fn's text and the code at machine_code are identical:
000002bc <fn>:
2bc: b580 push {r7, lr}
2be: b082 sub sp, #8
2c0: af00 add r7, sp, #0
2c2: 6078 str r0, [r7, #4]
2c4: 6039 str r1, [r7, #0]
2c6: 687a ldr r2, [r7, #4]
2c8: 683b ldr r3, [r7, #0]
2ca: 18d3 adds r3, r2, r3
2cc: 1c18 adds r0, r3, #0
2ce: 46bd mov sp, r7
2d0: b002 add sp, #8
2d2: bd80 pop {r7, pc}
000002d4 <machine_code>:
2d4: b580 push {r7, lr}
2d6: b082 sub sp, #8
2d8: af00 add r7, sp, #0
2da: 6078 str r0, [r7, #4]
2dc: 6039 str r1, [r7, #0]
2de: 687a ldr r2, [r7, #4]
2e0: 683b ldr r3, [r7, #0]
2e2: 18d3 adds r3, r2, r3
2e4: 1c18 adds r0, r3, #0
2e6: 46bd mov sp, r7
2e8: b002 add sp, #8
2ea: bd80 pop {r7, pc}
000002ec <main>:
...
The only difference in the calling code is the location of the code called:
$ diff firmware.fn.bin.xxd firmware.machine_code.bin.xxd
54c54
< 0000350: 0040 0640 e02e 0000 bd02 0000 4042 0f00 .#.#........#B..
---
> 0000350: 0040 0640 e02e 0000 d402 0000 4042 0f00 .#.#........#B..
The second address d402 is the address of the machine_code array.
Curiously, the first address bd02 is a little-endian odd number (d is odd in hex).
The address of fn is 02bc (bc02 in big endian), so the pointer to fn is not the address of fn, but the address of fn plus one (or with the low bit set).
Changing the code to:
...
int main() {
LPC_SWM->PINASSIGN0 = 0xFFFFFF04UL;
serial.init(LPC_USART0, 115200);
SysTick_Config(12000000/1000); // 1ms ticks
//int(*fnptr)(int a, int b) = (int(*)(int, int))fn;
int machine_code_addr_low_bit_set = (int)machine_code | 1;
int(*fnptr)(int a, int b) = (int(*)(int, int))machine_code_addr_low_bit_set;
for (int a = 0; ; a++) {
int c = fnptr(a, 1000000);
printf("Hello world2 %d.\n", c);
delay(1000);
}
}
Makes it work.
Googling, I found:
The mechanism for switching makes use of the fact that all instructions must be (at least) halfword-aligned, which means that bit[0] of the branch target address is redundant. Therefore this bit can be re-used to indicate the target instruction set at that address. Bit[0] cleared to 0 means ARM and bit[0] set to 1 means Thumb.
on http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka12545.html
tl;dr
You need to set the low bit on function pointers when executing data as code on ARM Thumb.

Optimize C or assembly code in size for Cortex-M0

I need to reduce the code bloat for the Cortex-M0 microprocessor.
At startup the ROM data has to be copied to the RAM data once. Therefore I have this piece of code:
void __startup( void ){
extern unsigned int __data_init_start;
extern unsigned int __data_start;
extern unsigned int __data_end;
// copy .data section from flash to ram
s = & __data_init_start;
d = & __data_start;
e = & __data_end;
while( d != e ){
*d++ = *s++;
}
}
The assembly code that is generated by the compiler looks like this:
ldr r1, .L10+8
ldr r2, .L10+12
sub r0, r1, r2
lsr r3, r0, #2
add r3, r3, #1
lsl r1, r3, #2
mov r3, #0
.L4:
add r3, r3, #4
cmp r3, r1
beq .L9
.L5:
ldr r4, .L10+16
add r0, r2, r3
add r4, r3, r4
sub r4, r4, #4
ldr r4, [r4]
sub r0, r0, #4
str r4, [r0]
b .L4
How can I optimize this code so the code size is at minimum?

The compiler (or you!) does not realize that the range to copy is end - start. There seems to be some unnecessarily shuffling of data going on -- the 2 add and the sub in the loop. Also, it seems to me the compiler makes sure that the number of copies to make is a multiple of 4. An obvious optimization, then, is to make sure it is in advance! Below I assume it is (if not, the bne will fail and happily keep on copying and trample all over your memory).
Using my decade-old ARM assembler knowlegde (yes, that is a major disclaimer), and post-incrementing, I think the following short snippet is what it can be condensed to. From 18 instructions down to 8, not too bad. If it works.
ldr r1, __data_init_start
ldr r2, __data_start
ldr r3, __data_end
sub r4, r3, r2
.L1:
ldr r3, [r1], #4 ; safe to re-use r3 here
str r3, [r2], #4
subs r4, r4, #4
bne L1

May be that platform guarantees that writing to an unsigned int * you may change an unsigned int * value (i.e. it doesn't take advantage of type mismatch aliasing rules).
Then the code is inefficient because e is a global variable and the generated code logic must take in account that writing to *d may change the value of e.
Making at least e a local should solve this problem (most compilers know that aliasing a local that never had its address taken is not possible from a C point of view).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Data dependency detection bug with arm-none-eabi-gcc and -Ofast - c

Related

How do i know if the compiler will optimize a variable?

ARM Cortex M7 unaligned access and memcpy

Why the interrupt service routine ,PUSH {r3,r4,r5,lr} but POP {r0,r4,r5,lr},which lead to ERROR?

LPC810 - why won't machine code execute from an array of uint8_t in flash?

Optimize C or assembly code in size for Cortex-M0

Categories

Resources