C - occasional CPU stall during memcmp on Cortex-R5

C - occasional CPU stall during memcmp on Cortex-R5 - c

I'm running some tests on a Cortex-R5 (Ultrascale MpSoC). It basically generates 2 random numbers with a hardware module and compares them at the end to ensure they're not 0, nor the same values.
uint32_t status;
const uint8_t zeros[32] = {0};
uint8_t bytes1[32] = {0};
uint8_t bytes2[32] = {0};
// (generate random numbers and put them in bytes1)
// (generate random numbers and put them in bytes2)
printf("memcmp 0\n");
status = !memcmp(bytes1, bytes2, 32);
printf("memcmp 1\n");
status |= !memcmp(bytes1, zeros, 32);
printf("memcmp 2\n");
status |= !memcmp(bytes2, zeros, 32);
Some tests are running fine. Some executions are stalled after printing "memcmp 0" (when it freezes, it's always at the first memcmp)...
I have tried several things:
When I print the values in bytes1 and 2, they are indeed random numbers not equal to 0 and not equal with each other.
Moving the memcmp at different places, or switching the memcmp's. It's always the first one which freezes.
Replacing memcmp with a custom function to do comparison => it never freezes.
The memcmp function is used at other places of the code and it freezes nowhere else. Perhaps the difference is that the random check is the only place where the memcmp expects different values (at other places it's to ensure a function produces expected output).
I couldn't find the definition of memcmp... I don't know where to look. The only thing I could find is the assembly code, but it'd be difficult to attach a debugger to know exactly which instruction can't complete.
000064d0 <memcmp>:
64d0: 2a03 cmp r2, #3
64d2: b470 push {r4, r5, r6}
64d4: d912 bls.n 64fc <memcmp+0x2c>
64d6: ea40 0501 orr.w r5, r0, r1
64da: 4604 mov r4, r0
64dc: 07ad lsls r5, r5, #30
64de: 460b mov r3, r1
64e0: d120 bne.n 6524 <memcmp+0x54>
64e2: 681d ldr r5, [r3, #0]
64e4: 4619 mov r1, r3
64e6: 6826 ldr r6, [r4, #0]
64e8: 4620 mov r0, r4
64ea: 3304 adds r3, #4
64ec: 3404 adds r4, #4
64ee: 42ae cmp r6, r5
64f0: d118 bne.n 6524 <memcmp+0x54>
64f2: 3a04 subs r2, #4
64f4: 4620 mov r0, r4
64f6: 2a03 cmp r2, #3
64f8: 4619 mov r1, r3
64fa: d8f2 bhi.n 64e2 <memcmp+0x12>
64fc: 1e54 subs r4, r2, #1
64fe: b172 cbz r2, 651e <memcmp+0x4e>
6500: 7802 ldrb r2, [r0, #0]
6502: 780b ldrb r3, [r1, #0]
6504: 429a cmp r2, r3
6506: bf08 it eq
6508: 1864 addeq r4, r4, r1
650a: d006 beq.n 651a <memcmp+0x4a>
650c: e00c b.n 6528 <memcmp+0x58>
650e: f810 2f01 ldrb.w r2, [r0, #1]!
6512: f811 3f01 ldrb.w r3, [r1, #1]!
6516: 429a cmp r2, r3
6518: d106 bne.n 6528 <memcmp+0x58>
651a: 42a1 cmp r1, r4
651c: d1f7 bne.n 650e <memcmp+0x3e>
651e: 2000 movs r0, #0
6520: bc70 pop {r4, r5, r6}
6522: 4770 bx lr
6524: 1e54 subs r4, r2, #1
6526: e7eb b.n 6500 <memcmp+0x30>
6528: 1ad0 subs r0, r2, r3
652a: bc70 pop {r4, r5, r6}
652c: 4770 bx lr
652e: bf00 nop
Where can I see the source code of memcmp for cortex R5? FYI, the used compiler is armr5-none-eabi-gcc.
Any idea what could cause a CPU stall with this function?
Thank you

Related

ARM64 Backtrace from link register

I am currently trying to get backtrace based on stack pointer and link register on ARM64 device using C program.
Below is example of objdump
bar() calls foo() with 240444: ebfffd68 bl 23f9ec <foo##Base>
I can get link register (lr) and from that getting 23f9ec, save it to backtrace list as last routine.
My question: From below assembly code with current lr 0023f9ec <foo##Base>:, how to calculate to get previous routine with lr is 0023fe14 <bar##Base> using C language?
here is my implementation, but getting wrong previous lr
int bt(void** backtrace, int max_size) {
unsigned long* sp = __get_SP();
unsigned long* ra = __get_LR();
int* funcbase = (int*)(int)&bt;
int spofft = (short)((*funcbase));
sp = (char*)sp-spofft;
unsigned long* wra = (unsigned long*)ra;
int spofft;
int depth = 0;
while(ra) {
wra = ra;
while((*wra >> 16) != 0xe92d) {
wra--;
}
if(wra == 0)
return 0;
spofft = (short)(*wra & 0xffff);
if(depth < max_size)
backtrace[depth] = ra;
else
break;
ra =(unsigned long *)((unsigned long)ra + spofft);
sp =(unsigned long *)((unsigned long)sp + spofft);
depth++;
}
return 1;
}
0023f9ec <foo##Base>:
23f9ec: e92d42f3 push {r0, r1, r4, r5, r6, r7, r9, lr}
23f9f0: e1a09001 mov r9, r1
23f9f4: e1a07000 mov r7, r0
23f9f8: ebfffff9 bl 23f9e4 <__get_SP##Base>
23f9fc: e59f4060 ldr r4, [pc, #96] ; 23fa64 <foo##Base+0x78>
23fa00: e08f4004 add r4, pc, r4
23fa04: e1a05000 mov r5, r0
23fa08: ebfffff3 bl 23f9dc <__get_LR##Base>
23fa0c: e59f3054 ldr r3, [pc, #84] ; 23fa68 <foo##Base+0x7c>
23fa10: e3002256 movw r2, #598 ; 0x256
23fa14: e59f1050 ldr r1, [pc, #80] ; 23fa6c <foo##Base+0x80>
23fa18: e7943003 ldr r3, [r4, r3]
23fa1c: e08f1001 add r1, pc, r1
23fa20: e5934000 ldr r4, [r3]
23fa24: e1a03005 mov r3, r5
23fa28: e6bf4074 sxth r4, r4
23fa2c: e58d4004 str r4, [sp, #4]
23fa30: e1a06000 mov r6, r0
23fa34: e58d0000 str r0, [sp]
23fa38: e59f0030 ldr r0, [pc, #48] ; 23fa70 <foo##Base+0x84>
23fa3c: e08f0000 add r0, pc, r0
23fa40: ebfd456d bl 190ffc <printf#plt>
23fa44: e1a03009 mov r3, r9
23fa48: e1a02007 mov r2, r7
23fa4c: e1a01006 mov r1, r6
23fa50: e0640005 rsb r0, r4, r5
23fa54: ebffff70 bl 23f81c <get_prev_sp_ra2##Base>
23fa58: e3a00000 mov r0, #0
23fa5c: e28dd008 add sp, sp, #8
23fa60: e8bd82f0 pop {r4, r5, r6, r7, r9, pc}
23fa64: 003d5be0 eorseq r5, sp, r0, ror #23
23fa68: 000026c8 andeq r2, r0, r8, asr #13
23fa6c: 002b7ba6 eoreq r7, fp, r6, lsr #23
23fa70: 002b73e5 eoreq r7, fp, r5, ror #7
0023fe14 <bar##Base>:
23fe14: e92d4ef0 push {r4, r5, r6, r7, r9, sl, fp, lr}
23fe18: e24dde16 sub sp, sp, #352 ; 0x160
23fe1c: e59f76a8 ldr r7, [pc, #1704] ; 2404cc <bar##Base+0x6b8>
23fe20: e1a04000 mov r4, r0
23fe24: e59f66a4 ldr r6, [pc, #1700] ; 2404d0 <bar##Base+0x6bc>
23fe28: e1a03000 mov r3, r0
23fe2c: e59f26a0 ldr r2, [pc, #1696] ; 2404d4 <bar##Base+0x6c0>
23fe30: e08f7007 add r7, pc, r7
23fe34: e08f6006 add r6, pc, r6
23fe38: e3a00000 mov r0, #0
23fe3c: e08f2002 add r2, pc, r2
23fe40: e1a05001 mov r5, r1
23fe44: e3a01003 mov r1, #3
23fe48: e59f9688 ldr r9, [pc, #1672] ; 2404d8 <bar##Base+0x6c4>
.....................................................................
24043c: e3a0100f mov r1, #15
240440: e1a0000a mov r0, sl
240444: ebfffd68 bl 23f9ec <foo##Base>
240448: e59f2108 ldr r2, [pc, #264] ; 240558 <bar##Base+0x744>
24044c: e3a01003 mov r1, #3
240450: e08f2002 add r2, pc, r2
240454: e1a05000 mov r5, r0
240458: e1a03000 mov r3, r0
24045c: e3a00000 mov r0, #0

I don't think there's an easy way to do this.
Normally the register ABI of any operating system contains a "frame pointer" register. For example, on Apple's armv7 ABI, this is r7:
0x10006fc0 b0b5 push {r4, r5, r7, lr}
0x10006fc2 02af add r7, sp, 8
0x10006fc4 0448 ldr r0, [0x10006fd8]
0x10006fc6 d0e90c45 ldrd r4, r5, [r0, 0x30]
0x10006fca 0020 movs r0, 0
0x10006fcc fff7a6ff bl 0x10006f1c
0x10006fd0 0019 adds r0, r0, r4
0x10006fd2 6941 adcs r1, r5
0x10006fd4 b0bd pop {r4, r5, r7, pc}
If you dereference r7 there, you get to a pair of pointers, the second of which is lr, and the first of which is the r7 of the calling function, allowing you to repeat this process until you reach the bottom of the stack.
Judging by the assembly you posted, the codebase you're looking at doesn't have that. This means that the only way to obtain the return address is the same way that the code itself does: step forward through each instruction and parse/interpret them until you reach something that loads into pc. This is of course imperfect, since there may be functions in your call stack that do not ever return, but there's not much you can do about that.
It may be tempting to search backwards instead, and while you can do a heuristic approach and probably reach quite reasonable results with it, that is even less reliable than searching forward, since you have absolutely no way of telling whether you arrived at address X by stepping forward from the previous instruction or by explicitly jumping there from somewhere else.

Self written simple memset not working with -03 eabi gcc on ARMv7

I wrote a very simple memset in c that works fine up to -O2 but not with -O3...
memset:
void * memset(void * blk, int c, size_t n)
{
unsigned char * dst = blk;
while (n-- > 0)
*dst++ = (unsigned char)c;
return blk;
}
...which compiles to this assembly when using -O2:
20000430 <memset>:
20000430: e3520000 cmp r2, #0 # compare param 'n' with zero
20000434: 012fff1e bxeq lr # if equal return to caller
20000438: e6ef1071 uxtb r1, r1 # else zero extend (extract byte from) param 'c'
2000043c: e0802002 add r2, r0, r2 # add pointer 'blk' to 'n'
20000440: e1a03000 mov r3, r0 # move pointer 'blk' to r3
20000444: e4c31001 strb r1, [r3], #1 # store value of 'c' to address of r3, increment r3 for next pass
20000448: e1530002 cmp r3, r2 # compare current store address to calculated max address
2000044c: 1afffffc bne 20000444 <memset+0x14> # if not equal store next byte
20000450: e12fff1e bx lr # else back to caller
This makes sense to me. I annotated what happens here.
When I compile it with -O3 the program crashes. My memset calls itself repeatedly until it ate the whole stack:
200005e4 <memset>:
200005e4: e3520000 cmp r2, #0 # compare param 'n' with zero
200005e8: e92d4010 push {r4, lr} # ? (1)
200005ec: e1a04000 mov r4, r0 # move pointer 'blk' to r4 (temp to hold return value)
200005f0: 0a000001 beq 200005fc <memset+0x18> # if equal (first line compare) jump to epilogue
200005f4: e6ef1071 uxtb r1, r1 # zero extend (extract byte from) param 'c'
200005f8: ebfffff9 bl 200005e4 <memset> # call myself ? (2)
200005fc: e1a00004 mov r0, r4 # epilogue start. move return value to r0
20000600: e8bd8010 pop {r4, pc} # restore r4 and back to caller
I can't figure out how this optimised version is supposed to work without any strb or similar. It doesn't matter if I try to set the memory to '0' or something else so the function is not only called on .bss (zero initialised) variables.
(1) This is a problem. This push gets endlessly repeated without a matching pop as it's called by (2) when the function doesn't early-exit because of 'n' being zero. I verified this with uart prints. Also r2 is never touched so why should the compare to zero ever become true?
Please help me understand what's happening here. Is the compiler assuming prerequisites that I may not fulfill?
Background: I'm using external code that requires memset in my baremetal project so I rolled my own. It's only used once on startup and not performance critical.
/edit: The compiler is called with these options:
arm-none-eabi-gcc -O3 -Wall -Wextra -fPIC -nostdlib -nostartfiles -marm -fstrict-volatile-bitfields -march=armv7-a -mcpu=cortex-a9 -mfloat-abi=hard -mfpu=neon-vfpv3

Your first question (1). That is per the calling convention if you are going to make a nested function call you need to preserve the link register, and you need to be 64 bit aligned. The code uses r4 so that is the extra register saved. No magic there.
Your second question (2) it is not calling your memset it is optimizing your code because it sees it as an inefficient memset. Fuz has provided the answers to your question.
Rename the function
00000000 <xmemset>:
0: e3520000 cmp r2, #0
4: e92d4010 push {r4, lr}
8: e1a04000 mov r4, r0
c: 0a000001 beq 18 <xmemset+0x18>
10: e6ef1071 uxtb r1, r1
14: ebfffffe bl 0 <memset>
18: e1a00004 mov r0, r4
1c: e8bd8010 pop {r4, pc}
and you can see this.
If you were to use -ffreestanding as Fuz recommended then you see this or something like it
00000000 <xmemset>:
0: e3520000 cmp r2, #0
4: 012fff1e bxeq lr
8: e92d41f0 push {r4, r5, r6, r7, r8, lr}
c: e2426001 sub r6, r2, #1
10: e3560002 cmp r6, #2
14: e6efe071 uxtb lr, r1
18: 9a00002a bls c8 <xmemset+0xc8>
1c: e3a0c000 mov r12, #0
20: e3520023 cmp r2, #35 ; 0x23
24: e7c7c01e bfi r12, lr, #0, #8
28: e1a04122 lsr r4, r2, #2
2c: e7cfc41e bfi r12, lr, #8, #8
30: e7d7c81e bfi r12, lr, #16, #8
34: e7dfcc1e bfi r12, lr, #24, #8
38: 9a000024 bls d0 <xmemset+0xd0>
3c: e2445009 sub r5, r4, #9
40: e1a03000 mov r3, r0
44: e3c55007 bic r5, r5, #7
48: e3a07000 mov r7, #0
4c: e2851008 add r1, r5, #8
50: e1570005 cmp r7, r5
54: f5d3f0a0 pld [r3, #160] ; 0xa0
58: e1a08007 mov r8, r7
5c: e583c000 str r12, [r3]
60: e583c004 str r12, [r3, #4]
64: e2877008 add r7, r7, #8
68: e583c008 str r12, [r3, #8]
6c: e2833020 add r3, r3, #32
70: e503c014 str r12, [r3, #-20] ; 0xffffffec
74: e503c010 str r12, [r3, #-16]
78: e503c00c str r12, [r3, #-12]
7c: e503c008 str r12, [r3, #-8]
80: e503c004 str r12, [r3, #-4]
84: 1afffff1 bne 50 <xmemset+0x50>
88: e2811001 add r1, r1, #1
8c: e483c004 str r12, [r3], #4
90: e1540001 cmp r4, r1
94: 8afffffb bhi 88 <xmemset+0x88>
98: e3c23003 bic r3, r2, #3
9c: e1520003 cmp r2, r3
a0: e0466003 sub r6, r6, r3
a4: e0803003 add r3, r0, r3
a8: 08bd81f0 popeq {r4, r5, r6, r7, r8, pc}
ac: e3560000 cmp r6, #0
b0: e5c3e000 strb lr, [r3]
b4: 08bd81f0 popeq {r4, r5, r6, r7, r8, pc}
b8: e3560001 cmp r6, #1
bc: e5c3e001 strb lr, [r3, #1]
c0: 15c3e002 strbne lr, [r3, #2]
c4: e8bd81f0 pop {r4, r5, r6, r7, r8, pc}
c8: e1a03000 mov r3, r0
cc: eafffff6 b ac <xmemset+0xac>
d0: e1a03000 mov r3, r0
d4: e3a01000 mov r1, #0
d8: eaffffea b 88 <xmemset+0x88>
which appears like it simply inlined memset, the one it knows not your code (the faster one).
So if you want it to use your code then stick with -O2. Yours is pretty inefficient so not sure why you need to push it any further than it was.
20000444: e4c31001 strb r1, [r3], #1 # store value of 'c' to address of r3, increment r3 for next pass
20000448: e1530002 cmp r3, r2 # compare current store address to calculated max address
2000044c: 1afffffc bne 20000444 <memset+0x14> # if not equal store next byte
It isn't going to get any better than that without replacing your code with something else.
Fuz already answered the question:
Compile with -fno-builtin-memset. The compiler recognises that the function implements memset and thus replaces it with a call to memset. You should in general compile with -ffreestanding when writing bare-metal code. I believe this fixes this sort of problem, too
It is replacing your code with memset, if you want it not to do that use -ffreestanding.
If you wish to go beyond that and wonder why -fno-builtin-memset didn't work that is a question for the gcc folks, file a ticket, let us know what they say (or just look at the compiler source code).

Why the interrupt service routine ,PUSH {r3,r4,r5,lr} but POP {r0,r4,r5,lr},which lead to ERROR?

I am using IAR to compile routines, but run error on ARM A7; then i got the question below when i open the .lst file generated by IAR.
It is a ISR, first push {r3, r4, r5, lr}, but POP {r0, r4, r5, lr} when return, the R0 value is changed to the value of R3 before push. So R0 is wrong when returned from irqHandler which lead to error in follow routines.
why ?
void irqHandler(void)
{
878: e92d4038 push {r3, r4, r5, lr}
volatile u32 *pt = (u32 *)AM_INTC_BASE;
87c: e3a044b0 mov r4, #176, 8 ; 0xb0000000
u32 id_spin;
id_spin = *(pt+0x200c/4) & 0x3ff;
880: e302000c movw r0, #8204 ; 0x200c
884: e7900004 ldr r0, [r0, r4]
888: e1b00b00 lsls r0, r0, #22
88c: e1b00b20 lsrs r0, r0, #22
890: e1b05000 movs r5, r0
if(id_spin<32)
894: e3550020 cmp r5, #32
898: 2a000000 bcs 8a0 <irqHandler+0x28>
{
#ifdef WHOLECHIPSIM
print("id_spid<32 error...\r\n",0);
#endif
while(1);
89c: eafffffe b 89c <irqHandler+0x24>
}
else
{
(pFuncIrq[id_spin-32])();
8a0: e59f0010 ldr r0, [pc, #16] ; 8b8 <.text_8>
8a4: e1b01105 lsls r1, r5, #2
8a8: e0910000 adds r0, r1, r0
8ac: e5100080 ldr r0, [r0, #-128] ; 0x80
8b0: e12fff30 blx r0
}
}
8b4: e8bd8031 pop {r0, r4, r5, pc}

The abi requires a 64 bit aligned stack, so the push of r3 simply facilitates that. Could have chosen any register not already specified. Likewise on the pop they need to clean up the stack the function is prototyped as void so the return (r0) is a dont care and r0-r3 are not expected to be preserved so no reason to match the r3 on each end nor match an r0 on each end.
had they chose a register numbered above r3 (r6 for example) on the push then that would have needed to be matched on the pop. Otherwise the pop would have to be one of r0-r3 to not trash a non-volatile register. (couldnt push r3 then pop r6 that would trash r6)

It does not matter as R0-R3, R12, LR, PC, xPSR are saved on the stack automaticly when the hardware invokes the interrupt vector routine. When bx, ldm, pop, or ldr with PC is invoked hardware executes interrupt routine exit poping those registers.
Do not check your compiler. It knows what it does. Check tour wrong logic - especially printing strings in the interrupt handler.

assemble code with the keyword __irq __arm is below:
__irq __arm void irqHandler(void)
{
878: e24ee004 sub lr, lr, #4
87c: e92d503f push {r0, r1, r2, r3, r4, r5, ip, lr}
volatile u32 *pt = (u32 *)AM_INTC_BASE;
880: e3a044b0 mov r4, #176, 8 ; 0xb0000000
u32 id_spin;
id_spin = *(pt+0x200c/4) & 0x3ff;
884: e302000c movw r0, #8204 ; 0x200c
888: e7900004 ldr r0, [r0, r4]
88c: e1b00b00 lsls r0, r0, #22
890: e1b00b20 lsrs r0, r0, #22
894: e1b05000 movs r5, r0
if(id_spin<32)
898: e3550020 cmp r5, #32
89c: 2a000000 bcs 8a4 <irqHandler+0x2c>
{
#ifdef WHOLECHIPSIM
print("id_spid<32 error...\r\n",0);
#endif
while(1);
8a0: eafffffe b 8a0 <irqHandler+0x28>
}
else
{
(pFuncIrq[id_spin-32])();
8a4: e59f0010 ldr r0, [pc, #16] ; 8bc <.text_8>
8a8: e1b01105 lsls r1, r5, #2
8ac: e0910000 adds r0, r1, r0
8b0: e5100080 ldr r0, [r0, #-128] ; 0x80
8b4: e12fff30 blx r0
}
}
8b8: e8fd903f ldm sp!, {r0, r1, r2, r3, r4, r5, ip, pc}^

Cortex A7 PUSH log ,it just push 7 register, so 32bit aligned is ok
follow link is the log info:
http://img.blog.csdn.net/20170819120758443?watermark/2/text/aHR0cDovL2Jsb2cuY3Nkbi5uZXQvcmFpbmJvd2JpcmRzX2Flcw==/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70/gravity/Center

Why does direct accessing to structure members produces significantly more assembly code compared to indirect accessing in GCC?

There are many topics here whether a direct access or an indirect access (via pointer) are faster when accessing structure members, in C.
One example: C pointers vs direct member access for structs
The general opinion is direct access will be faster (at least theoretically) since pointer dereferencing is not used.
So I gave it a try with a chunk of code in my system: GNU Embedded Tools GCC 4.7.4, generating code for ARM (actually ARM-Cortex-A15).
Surprisingly, direct access was much slower. Then I generated the assembly codes for the object file.
Direct access code has 114 lines of assembly code, and indirect access code has 33 lines of assembly code. What is going on here?
Below are the C code and generated assembly code of the functions. The structures are all map to external memory and the structure members are all one-byte word long (unsigned char type).
First function, with indirect access:
void sub_func_1(unsigned int num_of, struct file_s *__restrict__ first_file_ptr, struct file_s *__restrict__ second_file_ptr, struct output_s *__restrict__ output_ptr)
{
if(LIKELY(num_of == 0))
{
output_ptr->curr_id = UNUSED;
output_ptr->curr_cnt = output_ptr->cnt;
output_ptr->curr_mode = output_ptr->_mode;
output_ptr->curr_type = output_ptr->type;
output_ptr->curr_size = output_ptr->size;
output_ptr->curr_allocation_type = output_ptr->allocation_type;
output_ptr->curr_allocation_localized = output_ptr->allocation_localized;
output_ptr->curr_mode_enable = output_ptr->mode_enable;
if(output_ptr->curr_cnt == 1)
{
first_file_ptr->status = BLOCK_IDLE;
first_file_ptr->type = USER_DATA_TYPE;
first_file_ptr->index = FIRST__WORD;
first_file_ptr->layer_cnt = output_ptr->layer_cnt;
second_file_ptr->status = DISABLED;
second_file_ptr->index = 0;
second_file_ptr->redundancy_version = 1;
output_ptr->total_layer_cnt = first_file_ptr->layer_cnt;
}
}
}
00000000 <sub_func_1>:
0: e3500000 cmp r0, #0
4: e92d01f0 push {r4, r5, r6, r7, r8}
8: 1a00001b bne 7c <sub_func_1+0x7c>
c: e5d34007 ldrb r4, [r3, #7]
10: e3a05008 mov r5, #8
14: e5d3c003 ldrb ip, [r3, #3]
18: e5d38014 ldrb r8, [r3, #20]
1c: e5c35001 strb r5, [r3, #1]
20: e5d37015 ldrb r7, [r3, #21]
24: e5d36018 ldrb r6, [r3, #24]
28: e5c34008 strb r4, [r3, #8]
2c: e5d35019 ldrb r5, [r3, #25]
30: e35c0001 cmp ip, #1
34: e5c3c005 strb ip, [r3, #5]
38: e5d34012 ldrb r4, [r3, #18]
3c: e5c38010 strb r8, [r3, #16]
40: e5c37011 strb r7, [r3, #17]
44: e5c3601a strb r6, [r3, #26]
48: e5c3501b strb r5, [r3, #27]
4c: e5c34013 strb r4, [r3, #19]
50: 1a000009 bne 7c <sub_func_1+0x7c>
54: e5d3400b ldrb r4, [r3, #11]
58: e3a05005 mov r5, #5
5c: e5c1c000 strb ip, [r1]
60: e5c10002 strb r0, [r1, #2]
64: e5c15001 strb r5, [r1, #1]
68: e5c20000 strb r0, [r2]
6c: e5c14003 strb r4, [r1, #3]
70: e5c20005 strb r0, [r2, #5]
74: e5c2c014 strb ip, [r2, #20]
78: e5c3400f strb r4, [r3, #15]
7c: e8bd01f0 pop {r4, r5, r6, r7, r8}
80: e12fff1e bx lr
Second function, with direct access:
void sub_func_2(unsigned int output_index, unsigned int cc_index, unsigned int num_of)
{
if(LIKELY(num_of == 0))
{
output_file[output_index].curr_id = UNUSED;
output_file[output_index].curr_cnt = output_file[output_index].cnt;
output_file[output_index].curr_mode = output_file[output_index]._mode;
output_file[output_index].curr_type = output_file[output_index].type;
output_file[output_index].curr_size = output_file[output_index].size;
output_file[output_index].curr_allocation_type = output_file[output_index].allocation_type;
output_file[output_index].curr_allocation_localized = output_file[output_index].allocation_localized;
output_file[output_index].curr_mode_enable = output_file[output_index].mode_enable;
if(output_file[output_index].curr_cnt == 1)
{
output_file[output_index].cc_file[cc_index].file[0].status = BLOCK_IDLE;
output_file[output_index].cc_file[cc_index].file[0].type = USER_DATA_TYPE;
output_file[output_index].cc_file[cc_index].file[0].index = FIRST__WORD;
output_file[output_index].cc_file[cc_index].file[0].layer_cnt = output_file[output_index].layer_cnt;
output_file[output_index].cc_file[cc_index].file[1].status = DISABLED;
output_file[output_index].cc_file[cc_index].file[1].index = 0;
output_file[output_index].cc_file[cc_index].file[1].redundancy_version = 1;
output_file[output_index].total_layer_cnt = output_file[output_index].cc_file[cc_index].file[0].layer_cnt;
}
}
}
00000084 <sub_func_2>:
84: e92d0ff0 push {r4, r5, r6, r7, r8, r9, sl, fp}
88: e3520000 cmp r2, #0
8c: e24dd018 sub sp, sp, #24
90: e58d2004 str r2, [sp, #4]
94: 1a000069 bne 240 <sub_func_2+0x1bc>
98: e3a03d61 mov r3, #6208 ; 0x1840
9c: e30dc0c0 movw ip, #53440 ; 0xd0c0
a0: e340c001 movt ip, #1
a4: e3002000 movw r2, #0
a8: e0010193 mul r1, r3, r1
ac: e3402000 movt r2, #0
b0: e3067490 movw r7, #25744 ; 0x6490
b4: e3068488 movw r8, #25736 ; 0x6488
b8: e3a0b008 mov fp, #8
bc: e3066498 movw r6, #25752 ; 0x6498
c0: e02c109c mla ip, ip, r0, r1
c4: e082c00c add ip, r2, ip
c8: e28c3b19 add r3, ip, #25600 ; 0x6400
cc: e08c4007 add r4, ip, r7
d0: e5d39083 ldrb r9, [r3, #131] ; 0x83
d4: e08c5006 add r5, ip, r6
d8: e5d3a087 ldrb sl, [r3, #135] ; 0x87
dc: e5c3b081 strb fp, [r3, #129] ; 0x81
e0: e5c39085 strb r9, [r3, #133] ; 0x85
e4: e2833080 add r3, r3, #128 ; 0x80
e8: e7cca008 strb sl, [ip, r8]
ec: e5d4a004 ldrb sl, [r4, #4]
f0: e7cca007 strb sl, [ip, r7]
f4: e5d47005 ldrb r7, [r4, #5]
f8: e5c47001 strb r7, [r4, #1]
fc: e7dc6006 ldrb r6, [ip, r6]
100: e5d5c001 ldrb ip, [r5, #1]
104: e5c56002 strb r6, [r5, #2]
108: e5c5c003 strb ip, [r5, #3]
10c: e5d4c002 ldrb ip, [r4, #2]
110: e5c4c003 strb ip, [r4, #3]
114: e5d33005 ldrb r3, [r3, #5]
118: e3530001 cmp r3, #1
11c: 1a000047 bne 240 <sub_func_2+0x1bc>
120: e30dc0c0 movw ip, #53440 ; 0xd0c0
124: e30db0c0 movw fp, #53440 ; 0xd0c0
128: e1a0700c mov r7, ip
12c: e7dfc813 bfi ip, r3, #16, #16
130: e1a05007 mov r5, r7
134: e1a0900b mov r9, fp
138: e02c109c mla ip, ip, r0, r1
13c: e1a04005 mov r4, r5
140: e1a0a00b mov sl, fp
144: e7df9813 bfi r9, r3, #16, #16
148: e7dfb813 bfi fp, r3, #16, #16
14c: e1a06007 mov r6, r7
150: e7dfa813 bfi sl, r3, #16, #16
154: e58dc008 str ip, [sp, #8]
158: e7df6813 bfi r6, r3, #16, #16
15c: e1a0c004 mov ip, r4
160: e7df4813 bfi r4, r3, #16, #16
164: e02b109b mla fp, fp, r0, r1
168: e7df5813 bfi r5, r3, #16, #16
16c: e0291099 mla r9, r9, r0, r1
170: e7df7813 bfi r7, r3, #16, #16
174: e7dfc813 bfi ip, r3, #16, #16
178: e0261096 mla r6, r6, r0, r1
17c: e0241094 mla r4, r4, r0, r1
180: e082b00b add fp, r2, fp
184: e0829009 add r9, r2, r9
188: e02a109a mla sl, sl, r0, r1
18c: e28bbc65 add fp, fp, #25856 ; 0x6500
190: e58d600c str r6, [sp, #12]
194: e2899c65 add r9, r9, #25856 ; 0x6500
198: e3a06005 mov r6, #5
19c: e58d4010 str r4, [sp, #16]
1a0: e59d4008 ldr r4, [sp, #8]
1a4: e0251095 mla r5, r5, r0, r1
1a8: e5cb3000 strb r3, [fp]
1ac: e082a00a add sl, r2, sl
1b0: e59db00c ldr fp, [sp, #12]
1b4: e5c96001 strb r6, [r9, #1]
1b8: e59d6004 ldr r6, [sp, #4]
1bc: e28aac65 add sl, sl, #25856 ; 0x6500
1c0: e58d5014 str r5, [sp, #20]
1c4: e0271097 mla r7, r7, r0, r1
1c8: e0825004 add r5, r2, r4
1cc: e30d40c0 movw r4, #53440 ; 0xd0c0
1d0: e02c109c mla ip, ip, r0, r1
1d4: e0855008 add r5, r5, r8
1d8: e7df4813 bfi r4, r3, #16, #16
1dc: e5ca6002 strb r6, [sl, #2]
1e0: e5d59003 ldrb r9, [r5, #3]
1e4: e082600b add r6, r2, fp
1e8: e59db014 ldr fp, [sp, #20]
1ec: e0201094 mla r0, r4, r0, r1
1f0: e2866c65 add r6, r6, #25856 ; 0x6500
1f4: e59d1010 ldr r1, [sp, #16]
1f8: e306a53c movw sl, #25916 ; 0x653c
1fc: e0827007 add r7, r2, r7
200: e2877c65 add r7, r7, #25856 ; 0x6500
204: e082c00c add ip, r2, ip
208: e5c69003 strb r9, [r6, #3]
20c: e59d6004 ldr r6, [sp, #4]
210: e28ccc65 add ip, ip, #25856 ; 0x6500
214: e082500b add r5, r2, fp
218: e0820000 add r0, r2, r0
21c: e0824001 add r4, r2, r1
220: e085500a add r5, r5, sl
224: e0808008 add r8, r0, r8
228: e7c4600a strb r6, [r4, sl]
22c: e5c56005 strb r6, [r5, #5]
230: e5c73050 strb r3, [r7, #80] ; 0x50
234: e5dc3003 ldrb r3, [ip, #3]
238: e287704c add r7, r7, #76 ; 0x4c
23c: e5c83007 strb r3, [r8, #7]
240: e28dd018 add sp, sp, #24
244: e8bd0ff0 pop {r4, r5, r6, r7, r8, r9, sl, fp}
248: e12fff1e bx lr
And last part, my compile options are:
# Compile options.
C_OPTS = -Wall \
-std=gnu99 \
-fgnu89-inline \
-Wcast-align \
-Werror=uninitialized \
-Werror=maybe-uninitialized \
-Werror=overflow \
-mcpu=cortex-a15 \
-mtune=cortex-a15 \
-mabi=aapcs \
-mfpu=neon \
-ftree-vectorize \
-ftree-slp-vectorize \
-ftree-vectorizer-verbose=4 \
-mfloat-abi=hard \
-O3 \
-flto \
-marm \
-ffat-lto-objects \
-fno-gcse \
-fno-strict-aliasing \
-fno-delete-null-pointer-checks \
-fno-strict-overflow \
-fuse-linker-plugin \
-falign-functions=4 \
-falign-loops=4 \
-falign-labels=4 \
-falign-jumps=4
Update:
Note: I deleted the structure definitions because there was differences with the version of my own program. It is actually a huge structure and it is not efficient to put here completely.
As suggested, I get rid of -fno-gcse, and the generated asm is not huge as before.
Without -fno-gcse, sub_func_1 generates the same code as above.
For sub_func_2:
00000084 <sub_func_2>:
84: e3520000 cmp r2, #0
88: e92d0070 push {r4, r5, r6}
8c: 1a000030 bne 154 <sub_func_2+0xd0>
90: e30d30c0 movw r3, #53440 ; 0xd0c0
94: e3a06008 mov r6, #8
98: e3403001 movt r3, #1
9c: e0030093 mul r3, r3, r0
a0: e3a00d61 mov r0, #6208 ; 0x1840
a4: e0213190 mla r1, r0, r1, r3
a8: e59f30ac ldr r3, [pc, #172] ; 15c <sub_func_2+0xd8>
ac: e0831001 add r1, r3, r1
b0: e2813b19 add r3, r1, #25600 ; 0x6400
b4: e5d34083 ldrb r4, [r3, #131] ; 0x83
b8: e1a00003 mov r0, r3
bc: e5d35087 ldrb r5, [r3, #135] ; 0x87
c0: e5c36081 strb r6, [r3, #129] ; 0x81
c4: e5c34085 strb r4, [r3, #133] ; 0x85
c8: e3064488 movw r4, #25736 ; 0x6488
cc: e2833080 add r3, r3, #128 ; 0x80
d0: e7c15004 strb r5, [r1, r4]
d4: e5d05094 ldrb r5, [r0, #148] ; 0x94
d8: e0844006 add r4, r4, r6
dc: e7c15004 strb r5, [r1, r4]
e0: e5d04095 ldrb r4, [r0, #149] ; 0x95
e4: e5d0c092 ldrb ip, [r0, #146] ; 0x92
e8: e5c04091 strb r4, [r0, #145] ; 0x91
ec: e3064498 movw r4, #25752 ; 0x6498
f0: e7d15004 ldrb r5, [r1, r4]
f4: e5c0c093 strb ip, [r0, #147] ; 0x93
f8: e5d04099 ldrb r4, [r0, #153] ; 0x99
fc: e5c0509a strb r5, [r0, #154] ; 0x9a
100: e5c0409b strb r4, [r0, #155] ; 0x9b
104: e5d33005 ldrb r3, [r3, #5]
108: e3530001 cmp r3, #1
10c: 1a000010 bne 154 <sub_func_2+0xd0>
110: e281cc65 add ip, r1, #25856 ; 0x6500
114: e3a06005 mov r6, #5
118: e2810b19 add r0, r1, #25600 ; 0x6400
11c: e1a0500c mov r5, ip
120: e5cc3000 strb r3, [ip]
124: e1a0400c mov r4, ip
128: e5cc6001 strb r6, [ip, #1]
12c: e5cc2002 strb r2, [ip, #2]
130: e5d0608b ldrb r6, [r0, #139] ; 0x8b
134: e5cc6003 strb r6, [ip, #3]
138: e306c53c movw ip, #25916 ; 0x653c
13c: e7c1200c strb r2, [r1, ip]
140: e5c52041 strb r2, [r5, #65] ; 0x41
144: e285503c add r5, r5, #60 ; 0x3c
148: e5c43050 strb r3, [r4, #80] ; 0x50
14c: e284404c add r4, r4, #76 ; 0x4c
150: e5c0608f strb r6, [r0, #143] ; 0x8f
154: e8bd0070 pop {r4, r5, r6}
158: e12fff1e bx lr
15c: 00000000 .word 0x00000000

TL:DR: can't reproduce that insane compiler output. Maybe the surrounding code + LTO did it?
I do have suggestions to improve the code: see the stuff below about copying whole structs instead of copying many individual members.
The question you linked is about accessing a value-type global directly vs. through a global pointer. On ARM, where it takes multiple instructions or a load from a nearby constant to get an arbitrary 32bit pointer into a register, passing around pointers is better than having each function reference a global directly.
See this example on the Godbolt Compiler Explorer (ARM gcc 4.8.2 -O3)
struct example {
int a, b, c;
} global_example;
int load_global(void) { return global_example.c; }
movw r3, #:lower16:global_example # tmp113,
movt r3, #:upper16:global_example # tmp113,
ldr r0, [r3, #8] #, global_example.c
bx lr #
int load_pointer(struct example *p) { return p->c; }
ldr r0, [r0, #8] #, p_2(D)->c
bx lr #
(Apparently gcc is horrible at passing structs by val as function args, see the code for byval(struct example by_val) on the godbolt link.)
Even worse is if you have a global pointer: first you have to load the value of the pointer, then another load to dereference it. This is the indirection overhead that was being discussed in the question you linked. If both loads miss in cache, you're paying the round-trip latency twice. The load address for the 2nd load isn't available until the first load completes, so no pipelining of those memory requests is possible even on an out-of-order CPU.
If you already have a pointer as an arg, it will be in a register. Dereferencing it is the same as loading from a global. (But better, because you don't need to get the global's address into a register yourself.)
Your real code
I can't reproduce your massive asm output with ARM gcc 4.8.2 on Godbolt, or locally with ARM gcc 5.2.1. I'm not using LTO, though, since I don't have a complete test program.
All I can see is just slightly larger code to do some index math.
bfi is Bitfield Insert. I think 144: e7df9813 bfi r9, r3, #16, #16 is setting the top half of r9 = low half of r3. I don't see how that and mla (integer mul-accumulate) make much sense. Other than perverse results from -ftree-vectorize, all I can think of is maybe -fno-gcse has a really bad impact for the version of gcc you tested.
Is it manipulating constants that are going to be stored? The code you actually posted #defines everything to 0, which gcc takes advantage of. (It also takes advantage of the fact that it already has 1 in a register if curr_cnt == 1, and stores that register for the second_file_ptr->redundancy_version = 1;). ARM doesn't have a str [mem], immediate or anything like x86's mov [mem], imm.
If your compiler output is from code with different values for those constants, the compiler would be doing more work to store different things.
Unfortunately gcc is bad at merging narrow stores into a single wider store (long-standing missed-optimization bug). For x86, clang does this in at least one case, storing 0x0100 (256) instead of a 0 and a 1. (check on godbolt by flipping the compiler to clang 3.7.1 or something, and removing the ARM-specific compiler args. There's a mov word ptr \[rsi\], 256 where gcc uses
mov BYTE PTR [rsi], 0 # *first_file_ptr_23(D).status,
mov BYTE PTR [rsi+1], 1 # *first_file_ptr_23(D).type,
If you arranged your structs carefully, there would be more opportunities for copying 4B blocks in this function.
It might also help to have two identical sub-structs of curr and not-curr, instead of curr_size and size. You might have to declare it packed to avoid padding after the sub-structs, though. Your two groups of members aren't in exactly the same order, which prevents compilers from doing much block-copying anyway when you do a bunch of assignments.
It helps gcc and clang copy multiple bytes at once if you do:
struct output_s_optimized {
struct __attribute__((packed)) stuff {
unsigned char cnt,
mode,
type,
size,
allocation_type,
allocation_localized,
mode_enable;
} curr; // 7B
unsigned char curr_id; // no non-curr id?
struct stuff non_curr;
unsigned char layer_cnt;
// Another 8 byte boundary here
unsigned char total_layer_cnt;
struct cc_file_s cc_file[128];
};
void foo(struct output_s_optimized *p) {
p->curr_id = 0;
p->non_curr = p->curr;
}
void bar(struct output_s_optimized *output_ptr) {
output_ptr->curr_id = 0;
output_ptr->curr.cnt = output_ptr->non_curr.cnt;
output_ptr->curr.mode = output_ptr->non_curr.mode;
output_ptr->curr.type = output_ptr->non_curr.type;
output_ptr->curr.size = output_ptr->non_curr.size;
output_ptr->curr.allocation_type = output_ptr->non_curr.allocation_type;
output_ptr->curr.allocation_localized = output_ptr->non_curr.allocation_localized;
output_ptr->curr.mode_enable = output_ptr->non_curr.mode_enable;
}
gcc 4.8.2 compiles foo() to three copies: byte, 2B, and 4B, even on ARM. It compiles bar() to eight 1B copies, and so does clang-3.8 on x86. So copying whole structs can help your compiler a lot (as well as making sure the data to be copied is arranged in the same order in both locations).
the same code on x86: nothing new
You can use -fverbose-asm to put comments on each line. For x86, the compiler output from gcc 6.1 -O3 is very similar between versions, as you can see on the Godbolt Compiler Explorer. x86 addressing modes can index a global variable directly, so you see stuff like
movzx edi, BYTE PTR [rcx+10] # *output_ptr_7(D)._mode
# where rcx is the output_ptr arg, used directly
vs.
movzx ecx, BYTE PTR output_file[rdi+10] # output_file[output_index_7(D)]._mode
# where rdi = output_index * 1297 (sizeof(output_file[0])), calculated once at the start
(gcc apparently doesn't care that each instruction has a 4B displacement as part of the addressing mode, but this is an ARM question so I won't go tradeoffs between code-size and insn count with x86's variable-length insns.)

In broad (architecture-agnostic) terms, this is what your instructions do:
global_structure_pointer->field = value;
loads the value of global_structure_pointer into an addressing register.
adds the offset represented by field to the addressing register.
stores value into the memory location addressed by the addressing register.
global_structure[index].field = value;
loads the address of global_structure into an addressing register.
loads the value of index into an arithmetic register.
multiplies the arithmetic register by the size of global_structure.
adds the arithmetic register to the addressing register.
stores value into the memory location addressed by the addressing register.
Your confusion seems to be due to a misunderstanding of what "direct access" actually is.
THIS is direct access:
global_structure.field = value;
What you thought of as direct access is in fact indexed access.

ARM Deliberately Bloating Compiled Code?

While working on the issue in Fastest Cortex M0+ Thumb 32x32=64 multiplication function? I wrote the following C function to see how it would compile:
uint64_t lmul(uint32_t a, uint32_t b){
uint32_t hia = a >> 16,
hib = b >> 16,
loa = (uint32_t)(uint16_t)a,
lob = (uint32_t)(uint16_t)b,
low = loa * lob,
mid1 = hia * lob,
mid2 = loa * hib,
mid = mid1 + mid2,
high = hia * hib;
if (mid < mid1)
high += 0x10000;
return ((uint64_t)high << 32) + ((uint64_t)mid << 16) + low;
}
After compiling it with the ARM GCC compiler 4.7.3 through CodeWarrior (what came with the Freescale dev board I'm using) with size optimization, it turned into this:
00000eac <lmul>:
eac: b570 push {r4, r5, r6, lr}
eae: 0c06 lsrs r6, r0, #16
eb0: b280 uxth r0, r0
eb2: 0c0a lsrs r2, r1, #16
eb4: 1c04 adds r4, r0, #0
eb6: b289 uxth r1, r1
eb8: 434c muls r4, r1
eba: 4350 muls r0, r2
ebc: 4371 muls r1, r6
ebe: 1843 adds r3, r0, r1
ec0: 4356 muls r6, r2
ec2: 428b cmp r3, r1
ec4: d202 bcs.n ecc <lmul+0x20>
ec6: 2580 movs r5, #128 ; 0x80
ec8: 026a lsls r2, r5, #9
eca: 18b6 adds r6, r6, r2
ecc: 0c19 lsrs r1, r3, #16
ece: 0418 lsls r0, r3, #16
ed0: 1c22 adds r2, r4, #0
ed2: 2300 movs r3, #0
ed4: 1c04 adds r4, r0, #0
ed6: 1c0d adds r5, r1, #0
ed8: 18a4 adds r4, r4, r2
eda: 415d adcs r5, r3
edc: 1c31 adds r1, r6, #0
ede: 1c18 adds r0, r3, #0
ee0: 1c22 adds r2, r4, #0
ee2: 1c2b adds r3, r5, #0
ee4: 1812 adds r2, r2, r0
ee6: 414b adcs r3, r1
ee8: 1c10 adds r0, r2, #0
eea: 1c19 adds r1, r3, #0
eec: bd70 pop {r4, r5, r6, pc}
I cannot fathom what the compiler is doing in the last 40% of the function. It's like it's playing musical registers for no other purpose than to increase the size of the function. Is this something ARM is known to do, or is there some strange purpose to this that I lack the ARM assembly expertise to comprehend?
If I didn't make any mistakes in substitution the last half of the function could be represented by:
ecc: 0c19 lsrs r1, r3, #16
ece: 0418 lsls r0, r3, #16
ed2: 2300 movs r3, #0
ed8: 18a4 adds r0, r0, r4
eda: 415d adcs r1, r3
ee6: 414b adds r1, r1, r6
eec: bd70 pop {r4, r5, r6, pc}

I haven't used the CodeWarrior tool chain, but I decided to try this with uVision using the ARMCC compiler v 5.03.0.76. Optimizing for space is the default option (-Ospace) and the generated code was still pretty ugly... not too different from yours. When I compiled with the -O2 it looked more like what you would expect:
0x0000008A B570 PUSH {r4-r6,lr}
0x0000008C 0C02 LSRS r2,r0,#16
0x0000008E 0C0C LSRS r4,r1,#16
0x00000090 B280 UXTH r0,r0
0x00000092 B289 UXTH r1,r1
0x00000094 4606 MOV r6,r0
0x00000096 4615 MOV r5,r2
0x00000098 434D MULS r5,r1,r5
0x0000009A 4360 MULS r0,r4,r0
0x0000009C 434E MULS r6,r1,r6
0x0000009E 182B ADDS r3,r5,r0
0x000000A0 4362 MULS r2,r4,r2
0x000000A2 42AB CMP r3,r5
0x000000A4 D202 BCS 0x000000AC
0x000000A6 2001 MOVS r0,#0x01
0x000000A8 0400 LSLS r0,r0,#16
0x000000AA 1812 ADDS r2,r2,r0
0x000000AC 2400 MOVS r4,#0x00
0x000000AE 0C19 LSRS r1,r3,#16
0x000000B0 0418 LSLS r0,r3,#16
0x000000B2 1900 ADDS r0,r0,r4
0x000000B4 4151 ADCS r1,r1,r2
0x000000B6 1980 ADDS r0,r0,r6
0x000000B8 4161 ADCS r1,r1,r4
0x000000BA BD70 POP {r4-r6,pc}
You can try compiling with different optimization options but I would suggest that you go with a newer compiler as Marc Glisse states in his comment.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

C - occasional CPU stall during memcmp on Cortex-R5 - c

Related

ARM64 Backtrace from link register

Self written simple memset not working with -03 eabi gcc on ARMv7

Why the interrupt service routine ,PUSH {r3,r4,r5,lr} but POP {r0,r4,r5,lr},which lead to ERROR?

Why does direct accessing to structure members produces significantly more assembly code compared to indirect accessing in GCC?

ARM Deliberately Bloating Compiled Code?

Categories

Resources