arm-none-eabi-gcc C Pointers - c

So working with C in the arm-none-eabi-gcc. I have been having an issue with pointers, they don't seem to exists. Perhaps I'm passing the wrong cmds to the compiler.
Here is an example.
unsigned int * gpuPointer = GetGPU_Pointer(framebufferAddress);
unsigned int color = 16;
int y = 768;
int x = 1024;
while(y >= 0)
{
while(x >= 0)
{
*gpuPointer = color;
color = color + 2;
x--;
}
color++;
y--;
x = 1024;
}
and the output from the disassembler.
81c8: ebffffc3 bl 80dc <GetGPU_Pointer>
81cc: e3a0c010 mov ip, #16 ; 0x10
81d0: e28c3b02 add r3, ip, #2048 ; 0x800
81d4: e2833002 add r3, r3, #2 ; 0x2
81d8: e1a03803 lsl r3, r3, #16
81dc: e1a01823 lsr r1, r3, #16
81e0: e1a0300c mov r3, ip
81e4: e1a02003 mov r2, r3
81e8: e2833002 add r3, r3, #2 ; 0x2
81ec: e1a03803 lsl r3, r3, #16
81f0: e1a03823 lsr r3, r3, #16
81f4: e1530001 cmp r3, r1
81f8: 1afffff9 bne 81e4 <setup_framebuffer+0x5c>
Shouldn't there be a str cmd around 81e4? To add further the GetGPU_Pointer is coming from an assembler file but there is a declaration as so.
extern unsigned int * GetGPU_Pointer(unsigned int framebufferAddress);
My gut feeling is its something absurdly simple but I'm missing it.

You never change the value of gpuPointer and you haven't declared it to point to a volatile. So from the compiler's perspective you are overwriting a single memory location (*gpuPointer) 768*1024 times, but since you never use the value you are writing into it, the compiler is entitled to optimize by doing a single write at the end of the loop.

Adding to rici's answer (upvote rici not me)...
It gets even better, taking what you offered and wrapping it
extern unsigned int * GetGPU_Pointer ( unsigned int );
void fun ( unsigned int framebufferAddress )
{
unsigned int * gpuPointer = GetGPU_Pointer(framebufferAddress);
unsigned int color = 16;
int y = 768;
int x = 1024;
while(y >= 0)
{
while(x >= 0)
{
*gpuPointer = color;
color = color + 2;
x--;
}
color++;
y--;
x = 1024;
}
}
Optimizes to
00000000 <fun>:
0: e92d4008 push {r3, lr}
4: ebfffffe bl 0 <GetGPU_Pointer>
8: e59f3008 ldr r3, [pc, #8] ; 18 <fun+0x18>
c: e5803000 str r3, [r0]
10: e8bd4008 pop {r3, lr}
14: e12fff1e bx lr
18: 00181110 andseq r1, r8, r0, lsl r1
because the code really doesnt do anything but that one store.
Now if you were to modify the pointer
while(x >= 0)
{
*gpuPointer = color;
gpuPointer++;
color = color + 2;
x--;
}
then you get the store you were looking for
00000000 <fun>:
0: e92d4010 push {r4, lr}
4: ebfffffe bl 0 <GetGPU_Pointer>
8: e59f403c ldr r4, [pc, #60] ; 4c <fun+0x4c>
c: e1a02000 mov r2, r0
10: e3a0c010 mov ip, #16
14: e2820a01 add r0, r2, #4096 ; 0x1000
18: e2801004 add r1, r0, #4
1c: e1a0300c mov r3, ip
20: e4823004 str r3, [r2], #4
24: e1520001 cmp r2, r1
28: e2833002 add r3, r3, #2
2c: 1afffffb bne 20 <fun+0x20>
30: e28ccb02 add ip, ip, #2048 ; 0x800
34: e28cc003 add ip, ip, #3
38: e15c0004 cmp ip, r4
3c: e2802004 add r2, r0, #4
40: 1afffff3 bne 14 <fun+0x14>
44: e8bd4010 pop {r4, lr}
48: e12fff1e bx lr
4c: 00181113 andseq r1, r8, r3, lsl r1
or if you make it volatile (and then dont have to modify it)
volatile unsigned int * gpuPointer = GetGPU_Pointer(framebufferAddress);
then
00000000 <fun>:
0: e92d4008 push {r3, lr}
4: ebfffffe bl 0 <GetGPU_Pointer>
8: e59fc02c ldr ip, [pc, #44] ; 3c <fun+0x3c>
c: e3a03010 mov r3, #16
10: e2831b02 add r1, r3, #2048 ; 0x800
14: e2812002 add r2, r1, #2
18: e5803000 str r3, [r0]
1c: e2833002 add r3, r3, #2
20: e1530002 cmp r3, r2
24: 1afffffb bne 18 <fun+0x18>
28: e2813003 add r3, r1, #3
2c: e153000c cmp r3, ip
30: 1afffff6 bne 10 <fun+0x10>
34: e8bd4008 pop {r3, lr}
38: e12fff1e bx lr
3c: 00181113 andseq r1, r8, r3, lsl r1
then you get your store
arm-none-eabi-gcc -O2 -c a.c -o a.o
arm-none-eabi-objdump -D a.o
arm-none-eabi-gcc (GCC) 4.8.2
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
The problem is, as written, you didnt tell the compiler to update the pointer more than the one time. So as in my first example it has no reason to even implement the loop, it can pre-compute the answer and write it one time. In order to force the compiler to implement the loop and write to the pointer more than one time, you either need to make it volatile and/or modify it, depends on what you were really needing to do.

Related

how to force arm gcc compiler to not to use 32bit access for an unaligned memory

I work on a memory which cannot handle 32bit access on an unaligned address. For unaligned addresses the memory supports 8bit level access.
In my code there is a memcpy, when I pass a unaligned address to memcpy the chip was getting stuck.
Upon looking deeper I figured out the generated assembly code of memcpy is doing a 32bit access to the address regardless of whether the given address is aligned to 32bit or not. When I reduced the optimization level to O(2) then the compiler generates code which always do a 8bit access.
[Edit] : Below is the memcpy code which I am using
void* memcpy(void * restrict s1, const void * restrict s2, size_t n)
{
char* ll = (char*)s1;
char* rr = (char*)s2;
for (size_t i = 0; i < n; i++) ll[i] = rr[i];
return s1;
}
Below is the disassembly of the code
void* memcpy3(void *s1, void *s2, size_t n)
{
char* ll = (char*)s1;
char* rr = (char*)s2;
for (size_t i = 0; i < n; i++) ll[i] = rr[i];
0: b38a cbz r2, 66 <memcpy3+0x66>
{
2: b4f0 push {r4, r5, r6, r7}
4: 1d03 adds r3, r0, #4
6: 1d0c adds r4, r1, #4
8: 42a0 cmp r0, r4
a: bf38 it cc
c: 4299 cmpcc r1, r3
e: d31e bcc.n 4e <memcpy3+0x4e>
10: 2a08 cmp r2, #8
12: d91c bls.n 4e <memcpy3+0x4e>
14: 460d mov r5, r1
16: 4604 mov r4, r0
for (size_t i = 0; i < n; i++) ll[i] = rr[i];
18: 2300 movs r3, #0
1a: 0897 lsrs r7, r2, #2
1c: f855 6b04 ldr.w r6, [r5], #4
20: 3301 adds r3, #1
22: 429f cmp r7, r3
24: f844 6b04 str.w r6, [r4], #4
28: d8f8 bhi.n 1c <memcpy3+0x1c>
2a: f022 0303 bic.w r3, r2, #3
2e: 429a cmp r2, r3
30: d00b beq.n 4a <memcpy3+0x4a>
32: 56cd ldrsb r5, [r1, r3]
34: 1c5c adds r4, r3, #1
36: 42a2 cmp r2, r4
38: 54c5 strb r5, [r0, r3]
3a: d906 bls.n 4a <memcpy3+0x4a>
3c: 570d ldrsb r5, [r1, r4]
3e: 3302 adds r3, #2
40: 429a cmp r2, r3
42: 5505 strb r5, [r0, r4]
44: d901 bls.n 4a <memcpy3+0x4a>
46: 56ca ldrsb r2, [r1, r3]
48: 54c2 strb r2, [r0, r3]
return s1;
}
4a: bcf0 pop {r4, r5, r6, r7}
4c: 4770 bx lr
4e: 3a01 subs r2, #1
50: 440a add r2, r1
52: 1e43 subs r3, r0, #1
54: 3901 subs r1, #1
for (size_t i = 0; i < n; i++) ll[i] = rr[i];
56: f911 4f01 ldrsb.w r4, [r1, #1]!
5a: 4291 cmp r1, r2
5c: f803 4f01 strb.w r4, [r3, #1]!
60: d1f9 bne.n 56 <memcpy3+0x56>
}
62: bcf0 pop {r4, r5, r6, r7}
64: 4770 bx lr
66: 4770 bx lr
Is it possible to configure the arm-gcc compiler to not to use a 32bit access on an unaligned address.
Use -mno-unaligned-access flag to tell the compiler to not to use unaligned access. By default the compiler uses -munaligned-access.
Use the "-mcpu=" flag to set the processor type, not "-march=", as it covers more of the options.
The processor determines whether unaligned accesses are allowed to the bus interface. But an unaligned access will be translated into smaller parts before the access to the memory device. It really doesn't make sense to say that a memory does not support unaligned accesses, as it never sees them regardless of what the core does.

Example of a static vs automatic variable in assembly

In C, we can use the following two examples to show the difference between a static and non-static variable:
for (int i = 0; i < 5; ++i) {
static int n = 0;
printf("%d ", ++n); // prints 1 2 3 4 5 - the value persists
}
And:
for (int i = 0; i < 5; ++i) {
int n = 0;
printf("%d ", ++n); // prints 1 1 1 1 1 - the previous value is lost
}
Source: this answer.
What would be the most basic example in assembly to show the difference between how a static or non-static variable is created? (Or does this concept not exist in assembly?)
To implement a static object in assembly, you define it in a data section (of which there are various types, involving options for initialization and modification).
To implement an automatic object in assembly, you include space for it in the stack frame of a routine.
Examples, not necessarily syntactically correct in a particular assembly language, might be:
.data
foo: .word 34 // Static object named "foo".
.text
…
lr r3, foo // Load value of foo.
and:
.text
bar: // Start of routine named "bar".
foo = 16 // Define a symbol for convenience.
add sp, sp, -CalculatedSize // Allocate stack space for local data.
…
li r3, 34 // Load immediate value into register.
sr r3, foo(sp) // Store value into space reserved for foo on stack.
…
add sp, sp, +CalculatedSize // Automatic objects are released here.
ret
These are very simplified examples (as requested). Many modern schemes for using the hardware stack include frame pointers, which are not included above.
In the second example, CalculatedSize represents some amount that includes space for registers to be saved, space for the foo object, space for arguments for subroutine calls, and whatever other stack space is needed by the routine. The offset of 16 provided for foo is part of those calculations; the author of the routine would arrange their stack frame largely as they desire.
Just try it
void more_fun ( int );
void fun0 ( void )
{
for (int i = 0; i < 500; ++i) {
static int n = 0;
more_fun(++n);
}
}
void fun1 ( void )
{
for (int i = 0; i < 500; ++i) {
int n = 0;
more_fun( ++n);
}
}
Disassembly of section .text:
00000000 <fun0>:
0: e92d4070 push {r4, r5, r6, lr}
4: e3a04f7d mov r4, #500 ; 0x1f4
8: e59f501c ldr r5, [pc, #28] ; 2c <fun0+0x2c>
c: e5953000 ldr r3, [r5]
10: e2833001 add r3, r3, #1
14: e1a00003 mov r0, r3
18: e5853000 str r3, [r5]
1c: ebfffffe bl 0 <more_fun>
20: e2544001 subs r4, r4, #1
24: 1afffff8 bne c <fun0+0xc>
28: e8bd8070 pop {r4, r5, r6, pc}
2c: 00000000
00000030 <fun1>:
30: e92d4010 push {r4, lr}
34: e3a04f7d mov r4, #500 ; 0x1f4
38: e3a00001 mov r0, #1
3c: ebfffffe bl 0 <more_fun>
40: e2544001 subs r4, r4, #1
44: 1afffffb bne 38 <fun1+0x8>
48: e8bd8010 pop {r4, pc}
Disassembly of section .bss:
00000000 <n.4158>:
0: 00000000 andeq r0, r0, r0
I like to think of static locals as local globals. They sit in .bss or .data just like globals. But from a C perspective they can only be accessed within the function/context that they were created in.
I local variable has no need for long term storage, so it is "created" and destroyed within that fuction. If we were to not optimize you would see that
some stack space is allocated.
00000064 <fun1>:
64: e92d4800 push {fp, lr}
68: e28db004 add fp, sp, #4
6c: e24dd008 sub sp, sp, #8
70: e3a03000 mov r3, #0
74: e50b300c str r3, [fp, #-12]
78: ea000009 b a4 <fun1+0x40>
7c: e3a03000 mov r3, #0
80: e50b3008 str r3, [fp, #-8]
84: e51b3008 ldr r3, [fp, #-8]
88: e2833001 add r3, r3, #1
8c: e50b3008 str r3, [fp, #-8]
90: e51b0008 ldr r0, [fp, #-8]
94: ebfffffe bl 0 <more_fun>
98: e51b300c ldr r3, [fp, #-12]
9c: e2833001 add r3, r3, #1
a0: e50b300c str r3, [fp, #-12]
a4: e51b300c ldr r3, [fp, #-12]
a8: e3530f7d cmp r3, #500 ; 0x1f4
ac: bafffff2 blt 7c <fun1+0x18>
b0: e1a00000 nop ; (mov r0, r0)
b4: e24bd004 sub sp, fp, #4
b8: e8bd8800 pop {fp, pc}
But optimized for fun1 the local variable is kept in a register, faster than keeping on the stack, in this solution they save the upstream value held in r4 so
that r4 can be used to hold n within this function, when the function returns there is no more need for n per the rules of the language.
For the static local, per the rules of the language that value remains static
outside the function and can be accessed within. Because it is initialized to 0 it lives in .bss not .data (gcc, and many others). In the code above the linker
will fill this value
2c: 00000000
in with the address to this
00000000 <n.4158>:
0: 00000000 andeq r0, r0, r0
IMO one could argue the implementation didnt need to treat it like a volatile and
sample and save n every loop. Could have basically implemented it like the second
function, but sampled up front from memory and saved it in the end. Either way
you can see the difference in an implementation of the high level code. The non-static local only lives within the function and then its storage anc contents
are essentially gone.
When modifying a variable, the static local variable modified by static is executed only once, and the life cycle of the local variable is extended until the program is run.
If you don't add static, each loop reallocates the value.

Self written simple memset not working with -03 eabi gcc on ARMv7

I wrote a very simple memset in c that works fine up to -O2 but not with -O3...
memset:
void * memset(void * blk, int c, size_t n)
{
unsigned char * dst = blk;
while (n-- > 0)
*dst++ = (unsigned char)c;
return blk;
}
...which compiles to this assembly when using -O2:
20000430 <memset>:
20000430: e3520000 cmp r2, #0 # compare param 'n' with zero
20000434: 012fff1e bxeq lr # if equal return to caller
20000438: e6ef1071 uxtb r1, r1 # else zero extend (extract byte from) param 'c'
2000043c: e0802002 add r2, r0, r2 # add pointer 'blk' to 'n'
20000440: e1a03000 mov r3, r0 # move pointer 'blk' to r3
20000444: e4c31001 strb r1, [r3], #1 # store value of 'c' to address of r3, increment r3 for next pass
20000448: e1530002 cmp r3, r2 # compare current store address to calculated max address
2000044c: 1afffffc bne 20000444 <memset+0x14> # if not equal store next byte
20000450: e12fff1e bx lr # else back to caller
This makes sense to me. I annotated what happens here.
When I compile it with -O3 the program crashes. My memset calls itself repeatedly until it ate the whole stack:
200005e4 <memset>:
200005e4: e3520000 cmp r2, #0 # compare param 'n' with zero
200005e8: e92d4010 push {r4, lr} # ? (1)
200005ec: e1a04000 mov r4, r0 # move pointer 'blk' to r4 (temp to hold return value)
200005f0: 0a000001 beq 200005fc <memset+0x18> # if equal (first line compare) jump to epilogue
200005f4: e6ef1071 uxtb r1, r1 # zero extend (extract byte from) param 'c'
200005f8: ebfffff9 bl 200005e4 <memset> # call myself ? (2)
200005fc: e1a00004 mov r0, r4 # epilogue start. move return value to r0
20000600: e8bd8010 pop {r4, pc} # restore r4 and back to caller
I can't figure out how this optimised version is supposed to work without any strb or similar. It doesn't matter if I try to set the memory to '0' or something else so the function is not only called on .bss (zero initialised) variables.
(1) This is a problem. This push gets endlessly repeated without a matching pop as it's called by (2) when the function doesn't early-exit because of 'n' being zero. I verified this with uart prints. Also r2 is never touched so why should the compare to zero ever become true?
Please help me understand what's happening here. Is the compiler assuming prerequisites that I may not fulfill?
Background: I'm using external code that requires memset in my baremetal project so I rolled my own. It's only used once on startup and not performance critical.
/edit: The compiler is called with these options:
arm-none-eabi-gcc -O3 -Wall -Wextra -fPIC -nostdlib -nostartfiles -marm -fstrict-volatile-bitfields -march=armv7-a -mcpu=cortex-a9 -mfloat-abi=hard -mfpu=neon-vfpv3
Your first question (1). That is per the calling convention if you are going to make a nested function call you need to preserve the link register, and you need to be 64 bit aligned. The code uses r4 so that is the extra register saved. No magic there.
Your second question (2) it is not calling your memset it is optimizing your code because it sees it as an inefficient memset. Fuz has provided the answers to your question.
Rename the function
00000000 <xmemset>:
0: e3520000 cmp r2, #0
4: e92d4010 push {r4, lr}
8: e1a04000 mov r4, r0
c: 0a000001 beq 18 <xmemset+0x18>
10: e6ef1071 uxtb r1, r1
14: ebfffffe bl 0 <memset>
18: e1a00004 mov r0, r4
1c: e8bd8010 pop {r4, pc}
and you can see this.
If you were to use -ffreestanding as Fuz recommended then you see this or something like it
00000000 <xmemset>:
0: e3520000 cmp r2, #0
4: 012fff1e bxeq lr
8: e92d41f0 push {r4, r5, r6, r7, r8, lr}
c: e2426001 sub r6, r2, #1
10: e3560002 cmp r6, #2
14: e6efe071 uxtb lr, r1
18: 9a00002a bls c8 <xmemset+0xc8>
1c: e3a0c000 mov r12, #0
20: e3520023 cmp r2, #35 ; 0x23
24: e7c7c01e bfi r12, lr, #0, #8
28: e1a04122 lsr r4, r2, #2
2c: e7cfc41e bfi r12, lr, #8, #8
30: e7d7c81e bfi r12, lr, #16, #8
34: e7dfcc1e bfi r12, lr, #24, #8
38: 9a000024 bls d0 <xmemset+0xd0>
3c: e2445009 sub r5, r4, #9
40: e1a03000 mov r3, r0
44: e3c55007 bic r5, r5, #7
48: e3a07000 mov r7, #0
4c: e2851008 add r1, r5, #8
50: e1570005 cmp r7, r5
54: f5d3f0a0 pld [r3, #160] ; 0xa0
58: e1a08007 mov r8, r7
5c: e583c000 str r12, [r3]
60: e583c004 str r12, [r3, #4]
64: e2877008 add r7, r7, #8
68: e583c008 str r12, [r3, #8]
6c: e2833020 add r3, r3, #32
70: e503c014 str r12, [r3, #-20] ; 0xffffffec
74: e503c010 str r12, [r3, #-16]
78: e503c00c str r12, [r3, #-12]
7c: e503c008 str r12, [r3, #-8]
80: e503c004 str r12, [r3, #-4]
84: 1afffff1 bne 50 <xmemset+0x50>
88: e2811001 add r1, r1, #1
8c: e483c004 str r12, [r3], #4
90: e1540001 cmp r4, r1
94: 8afffffb bhi 88 <xmemset+0x88>
98: e3c23003 bic r3, r2, #3
9c: e1520003 cmp r2, r3
a0: e0466003 sub r6, r6, r3
a4: e0803003 add r3, r0, r3
a8: 08bd81f0 popeq {r4, r5, r6, r7, r8, pc}
ac: e3560000 cmp r6, #0
b0: e5c3e000 strb lr, [r3]
b4: 08bd81f0 popeq {r4, r5, r6, r7, r8, pc}
b8: e3560001 cmp r6, #1
bc: e5c3e001 strb lr, [r3, #1]
c0: 15c3e002 strbne lr, [r3, #2]
c4: e8bd81f0 pop {r4, r5, r6, r7, r8, pc}
c8: e1a03000 mov r3, r0
cc: eafffff6 b ac <xmemset+0xac>
d0: e1a03000 mov r3, r0
d4: e3a01000 mov r1, #0
d8: eaffffea b 88 <xmemset+0x88>
which appears like it simply inlined memset, the one it knows not your code (the faster one).
So if you want it to use your code then stick with -O2. Yours is pretty inefficient so not sure why you need to push it any further than it was.
20000444: e4c31001 strb r1, [r3], #1 # store value of 'c' to address of r3, increment r3 for next pass
20000448: e1530002 cmp r3, r2 # compare current store address to calculated max address
2000044c: 1afffffc bne 20000444 <memset+0x14> # if not equal store next byte
It isn't going to get any better than that without replacing your code with something else.
Fuz already answered the question:
Compile with -fno-builtin-memset. The compiler recognises that the function implements memset and thus replaces it with a call to memset. You should in general compile with -ffreestanding when writing bare-metal code. I believe this fixes this sort of problem, too
It is replacing your code with memset, if you want it not to do that use -ffreestanding.
If you wish to go beyond that and wonder why -fno-builtin-memset didn't work that is a question for the gcc folks, file a ticket, let us know what they say (or just look at the compiler source code).

How does runtime memory allocation for local variable's work in embedded systems?

We are using ccrx compiler, embOS RTOS
There is a function in the code,
void fun( )
{
if(condition)
{ int a;}
else if(condition1)
{int b;}............
else
{ int z;}
}
Whenever the function gets called, related thread stack overflows. If few of the int variable declarations are commented, the thread stack does not get overflowed.
How is the stack allocated? Doesn't the memory get allocated after a condition becomes successful?
Lets take GCC for example
void fun( int condition, int condition1 )
{
if(condition)
{ int a; a=5;}
else if(condition1)
{int b; b=7;}
else
{ int z; z=9; }
}
and pick a target, not going to pay for or whatever to get ccrx...
00000000 <fun>:
0: e52db004 push {r11} ; (str r11, [sp, #-4]!)
4: e28db000 add r11, sp, #0
8: e24dd01c sub sp, sp, #28
c: e50b0018 str r0, [r11, #-24] ; 0xffffffe8
10: e50b101c str r1, [r11, #-28] ; 0xffffffe4
14: e51b3018 ldr r3, [r11, #-24] ; 0xffffffe8
18: e3530000 cmp r3, #0
1c: 0a000002 beq 2c <fun+0x2c>
20: e3a03005 mov r3, #5
24: e50b3008 str r3, [r11, #-8]
28: ea000007 b 4c <fun+0x4c>
2c: e51b301c ldr r3, [r11, #-28] ; 0xffffffe4
30: e3530000 cmp r3, #0
34: 0a000002 beq 44 <fun+0x44>
38: e3a03007 mov r3, #7
3c: e50b300c str r3, [r11, #-12]
40: ea000001 b 4c <fun+0x4c>
44: e3a03009 mov r3, #9
48: e50b3010 str r3, [r11, #-16]
4c: e1a00000 nop ; (mov r0, r0)
50: e28bd000 add sp, r11, #0
54: e49db004 pop {r11} ; (ldr r11, [sp], #4)
58: e12fff1e bx lr
without the allocations
void fun( int condition, int condition1 )
{
if(condition)
{ int a;/* a=5;*/}
else if(condition1)
{int b;/* b=7;*/}
else
{ int z; /*z=9;*/ }
}
even with no optimization these variables are dead code and optimized out
00000000 <fun>:
0: e52db004 push {r11} ; (str r11, [sp, #-4]!)
4: e28db000 add r11, sp, #0
8: e24dd00c sub sp, sp, #12
c: e50b0008 str r0, [r11, #-8]
10: e50b100c str r1, [r11, #-12]
14: e1a00000 nop ; (mov r0, r0)
18: e28bd000 add sp, r11, #0
1c: e49db004 pop {r11} ; (ldr r11, [sp], #4)
20: e12fff1e bx lr
There are some bytes on the stack for alignment, they could have shaved some off and stayed aligned but that is another topic.
The point here is that just because in the high level language your variables are only used in a portion of the function doesnt mean that the compiler has to do it that way, compilers certainly gcc, tend to do all of their stack allocation at the beginning of the function and cleanup at the end. As was done here...
This is not unlike
int fun( void )
{
static int x;
x++;
if(x>10) return(1);
if(fun()) return(1);
return(0);
}
which gives
00000000 <fun>:
0: e59f2030 ldr r2, [pc, #48] ; 38 <fun+0x38>
4: e5923000 ldr r3, [r2]
8: e2833001 add r3, r3, #1
c: e353000a cmp r3, #10
10: e5823000 str r3, [r2]
14: da000001 ble 20 <fun+0x20>
18: e3a00001 mov r0, #1
1c: e12fff1e bx lr
20: e92d4010 push {r4, lr}
24: ebfffffe bl 0 <fun>
28: e2900000 adds r0, r0, #0
2c: 13a00001 movne r0, #1
30: e8bd4010 pop {r4, lr}
34: e12fff1e bx lr
38: 00000000 andeq r0, r0, r0
Disassembly of section .bss:
00000000 <x.4089>:
0: 00000000 andeq r0, r0, r0
it is a local variable but by being static goes into the global pool, not allocated on the stack like other local variables (or optimized into registers).
Although interesting that this was a case where it didnt allocate on the stack right away, although that is good in this case, dont want to burden the stack with recursion if you dont have to. Nice optimization.
No reason to assume that the stack pointer will change multiple times throughout the function because of what you did in the high level language. My guess is it makes developing the compiler easier to do it all in one shot up front even though that is wasteful on memory. On the other hand as you go would cost more instructions (space and time).

Why does direct accessing to structure members produces significantly more assembly code compared to indirect accessing in GCC?

There are many topics here whether a direct access or an indirect access (via pointer) are faster when accessing structure members, in C.
One example: C pointers vs direct member access for structs
The general opinion is direct access will be faster (at least theoretically) since pointer dereferencing is not used.
So I gave it a try with a chunk of code in my system: GNU Embedded Tools GCC 4.7.4, generating code for ARM (actually ARM-Cortex-A15).
Surprisingly, direct access was much slower. Then I generated the assembly codes for the object file.
Direct access code has 114 lines of assembly code, and indirect access code has 33 lines of assembly code. What is going on here?
Below are the C code and generated assembly code of the functions. The structures are all map to external memory and the structure members are all one-byte word long (unsigned char type).
First function, with indirect access:
void sub_func_1(unsigned int num_of, struct file_s *__restrict__ first_file_ptr, struct file_s *__restrict__ second_file_ptr, struct output_s *__restrict__ output_ptr)
{
if(LIKELY(num_of == 0))
{
output_ptr->curr_id = UNUSED;
output_ptr->curr_cnt = output_ptr->cnt;
output_ptr->curr_mode = output_ptr->_mode;
output_ptr->curr_type = output_ptr->type;
output_ptr->curr_size = output_ptr->size;
output_ptr->curr_allocation_type = output_ptr->allocation_type;
output_ptr->curr_allocation_localized = output_ptr->allocation_localized;
output_ptr->curr_mode_enable = output_ptr->mode_enable;
if(output_ptr->curr_cnt == 1)
{
first_file_ptr->status = BLOCK_IDLE;
first_file_ptr->type = USER_DATA_TYPE;
first_file_ptr->index = FIRST__WORD;
first_file_ptr->layer_cnt = output_ptr->layer_cnt;
second_file_ptr->status = DISABLED;
second_file_ptr->index = 0;
second_file_ptr->redundancy_version = 1;
output_ptr->total_layer_cnt = first_file_ptr->layer_cnt;
}
}
}
00000000 <sub_func_1>:
0: e3500000 cmp r0, #0
4: e92d01f0 push {r4, r5, r6, r7, r8}
8: 1a00001b bne 7c <sub_func_1+0x7c>
c: e5d34007 ldrb r4, [r3, #7]
10: e3a05008 mov r5, #8
14: e5d3c003 ldrb ip, [r3, #3]
18: e5d38014 ldrb r8, [r3, #20]
1c: e5c35001 strb r5, [r3, #1]
20: e5d37015 ldrb r7, [r3, #21]
24: e5d36018 ldrb r6, [r3, #24]
28: e5c34008 strb r4, [r3, #8]
2c: e5d35019 ldrb r5, [r3, #25]
30: e35c0001 cmp ip, #1
34: e5c3c005 strb ip, [r3, #5]
38: e5d34012 ldrb r4, [r3, #18]
3c: e5c38010 strb r8, [r3, #16]
40: e5c37011 strb r7, [r3, #17]
44: e5c3601a strb r6, [r3, #26]
48: e5c3501b strb r5, [r3, #27]
4c: e5c34013 strb r4, [r3, #19]
50: 1a000009 bne 7c <sub_func_1+0x7c>
54: e5d3400b ldrb r4, [r3, #11]
58: e3a05005 mov r5, #5
5c: e5c1c000 strb ip, [r1]
60: e5c10002 strb r0, [r1, #2]
64: e5c15001 strb r5, [r1, #1]
68: e5c20000 strb r0, [r2]
6c: e5c14003 strb r4, [r1, #3]
70: e5c20005 strb r0, [r2, #5]
74: e5c2c014 strb ip, [r2, #20]
78: e5c3400f strb r4, [r3, #15]
7c: e8bd01f0 pop {r4, r5, r6, r7, r8}
80: e12fff1e bx lr
Second function, with direct access:
void sub_func_2(unsigned int output_index, unsigned int cc_index, unsigned int num_of)
{
if(LIKELY(num_of == 0))
{
output_file[output_index].curr_id = UNUSED;
output_file[output_index].curr_cnt = output_file[output_index].cnt;
output_file[output_index].curr_mode = output_file[output_index]._mode;
output_file[output_index].curr_type = output_file[output_index].type;
output_file[output_index].curr_size = output_file[output_index].size;
output_file[output_index].curr_allocation_type = output_file[output_index].allocation_type;
output_file[output_index].curr_allocation_localized = output_file[output_index].allocation_localized;
output_file[output_index].curr_mode_enable = output_file[output_index].mode_enable;
if(output_file[output_index].curr_cnt == 1)
{
output_file[output_index].cc_file[cc_index].file[0].status = BLOCK_IDLE;
output_file[output_index].cc_file[cc_index].file[0].type = USER_DATA_TYPE;
output_file[output_index].cc_file[cc_index].file[0].index = FIRST__WORD;
output_file[output_index].cc_file[cc_index].file[0].layer_cnt = output_file[output_index].layer_cnt;
output_file[output_index].cc_file[cc_index].file[1].status = DISABLED;
output_file[output_index].cc_file[cc_index].file[1].index = 0;
output_file[output_index].cc_file[cc_index].file[1].redundancy_version = 1;
output_file[output_index].total_layer_cnt = output_file[output_index].cc_file[cc_index].file[0].layer_cnt;
}
}
}
00000084 <sub_func_2>:
84: e92d0ff0 push {r4, r5, r6, r7, r8, r9, sl, fp}
88: e3520000 cmp r2, #0
8c: e24dd018 sub sp, sp, #24
90: e58d2004 str r2, [sp, #4]
94: 1a000069 bne 240 <sub_func_2+0x1bc>
98: e3a03d61 mov r3, #6208 ; 0x1840
9c: e30dc0c0 movw ip, #53440 ; 0xd0c0
a0: e340c001 movt ip, #1
a4: e3002000 movw r2, #0
a8: e0010193 mul r1, r3, r1
ac: e3402000 movt r2, #0
b0: e3067490 movw r7, #25744 ; 0x6490
b4: e3068488 movw r8, #25736 ; 0x6488
b8: e3a0b008 mov fp, #8
bc: e3066498 movw r6, #25752 ; 0x6498
c0: e02c109c mla ip, ip, r0, r1
c4: e082c00c add ip, r2, ip
c8: e28c3b19 add r3, ip, #25600 ; 0x6400
cc: e08c4007 add r4, ip, r7
d0: e5d39083 ldrb r9, [r3, #131] ; 0x83
d4: e08c5006 add r5, ip, r6
d8: e5d3a087 ldrb sl, [r3, #135] ; 0x87
dc: e5c3b081 strb fp, [r3, #129] ; 0x81
e0: e5c39085 strb r9, [r3, #133] ; 0x85
e4: e2833080 add r3, r3, #128 ; 0x80
e8: e7cca008 strb sl, [ip, r8]
ec: e5d4a004 ldrb sl, [r4, #4]
f0: e7cca007 strb sl, [ip, r7]
f4: e5d47005 ldrb r7, [r4, #5]
f8: e5c47001 strb r7, [r4, #1]
fc: e7dc6006 ldrb r6, [ip, r6]
100: e5d5c001 ldrb ip, [r5, #1]
104: e5c56002 strb r6, [r5, #2]
108: e5c5c003 strb ip, [r5, #3]
10c: e5d4c002 ldrb ip, [r4, #2]
110: e5c4c003 strb ip, [r4, #3]
114: e5d33005 ldrb r3, [r3, #5]
118: e3530001 cmp r3, #1
11c: 1a000047 bne 240 <sub_func_2+0x1bc>
120: e30dc0c0 movw ip, #53440 ; 0xd0c0
124: e30db0c0 movw fp, #53440 ; 0xd0c0
128: e1a0700c mov r7, ip
12c: e7dfc813 bfi ip, r3, #16, #16
130: e1a05007 mov r5, r7
134: e1a0900b mov r9, fp
138: e02c109c mla ip, ip, r0, r1
13c: e1a04005 mov r4, r5
140: e1a0a00b mov sl, fp
144: e7df9813 bfi r9, r3, #16, #16
148: e7dfb813 bfi fp, r3, #16, #16
14c: e1a06007 mov r6, r7
150: e7dfa813 bfi sl, r3, #16, #16
154: e58dc008 str ip, [sp, #8]
158: e7df6813 bfi r6, r3, #16, #16
15c: e1a0c004 mov ip, r4
160: e7df4813 bfi r4, r3, #16, #16
164: e02b109b mla fp, fp, r0, r1
168: e7df5813 bfi r5, r3, #16, #16
16c: e0291099 mla r9, r9, r0, r1
170: e7df7813 bfi r7, r3, #16, #16
174: e7dfc813 bfi ip, r3, #16, #16
178: e0261096 mla r6, r6, r0, r1
17c: e0241094 mla r4, r4, r0, r1
180: e082b00b add fp, r2, fp
184: e0829009 add r9, r2, r9
188: e02a109a mla sl, sl, r0, r1
18c: e28bbc65 add fp, fp, #25856 ; 0x6500
190: e58d600c str r6, [sp, #12]
194: e2899c65 add r9, r9, #25856 ; 0x6500
198: e3a06005 mov r6, #5
19c: e58d4010 str r4, [sp, #16]
1a0: e59d4008 ldr r4, [sp, #8]
1a4: e0251095 mla r5, r5, r0, r1
1a8: e5cb3000 strb r3, [fp]
1ac: e082a00a add sl, r2, sl
1b0: e59db00c ldr fp, [sp, #12]
1b4: e5c96001 strb r6, [r9, #1]
1b8: e59d6004 ldr r6, [sp, #4]
1bc: e28aac65 add sl, sl, #25856 ; 0x6500
1c0: e58d5014 str r5, [sp, #20]
1c4: e0271097 mla r7, r7, r0, r1
1c8: e0825004 add r5, r2, r4
1cc: e30d40c0 movw r4, #53440 ; 0xd0c0
1d0: e02c109c mla ip, ip, r0, r1
1d4: e0855008 add r5, r5, r8
1d8: e7df4813 bfi r4, r3, #16, #16
1dc: e5ca6002 strb r6, [sl, #2]
1e0: e5d59003 ldrb r9, [r5, #3]
1e4: e082600b add r6, r2, fp
1e8: e59db014 ldr fp, [sp, #20]
1ec: e0201094 mla r0, r4, r0, r1
1f0: e2866c65 add r6, r6, #25856 ; 0x6500
1f4: e59d1010 ldr r1, [sp, #16]
1f8: e306a53c movw sl, #25916 ; 0x653c
1fc: e0827007 add r7, r2, r7
200: e2877c65 add r7, r7, #25856 ; 0x6500
204: e082c00c add ip, r2, ip
208: e5c69003 strb r9, [r6, #3]
20c: e59d6004 ldr r6, [sp, #4]
210: e28ccc65 add ip, ip, #25856 ; 0x6500
214: e082500b add r5, r2, fp
218: e0820000 add r0, r2, r0
21c: e0824001 add r4, r2, r1
220: e085500a add r5, r5, sl
224: e0808008 add r8, r0, r8
228: e7c4600a strb r6, [r4, sl]
22c: e5c56005 strb r6, [r5, #5]
230: e5c73050 strb r3, [r7, #80] ; 0x50
234: e5dc3003 ldrb r3, [ip, #3]
238: e287704c add r7, r7, #76 ; 0x4c
23c: e5c83007 strb r3, [r8, #7]
240: e28dd018 add sp, sp, #24
244: e8bd0ff0 pop {r4, r5, r6, r7, r8, r9, sl, fp}
248: e12fff1e bx lr
And last part, my compile options are:
# Compile options.
C_OPTS = -Wall \
-std=gnu99 \
-fgnu89-inline \
-Wcast-align \
-Werror=uninitialized \
-Werror=maybe-uninitialized \
-Werror=overflow \
-mcpu=cortex-a15 \
-mtune=cortex-a15 \
-mabi=aapcs \
-mfpu=neon \
-ftree-vectorize \
-ftree-slp-vectorize \
-ftree-vectorizer-verbose=4 \
-mfloat-abi=hard \
-O3 \
-flto \
-marm \
-ffat-lto-objects \
-fno-gcse \
-fno-strict-aliasing \
-fno-delete-null-pointer-checks \
-fno-strict-overflow \
-fuse-linker-plugin \
-falign-functions=4 \
-falign-loops=4 \
-falign-labels=4 \
-falign-jumps=4
Update:
Note: I deleted the structure definitions because there was differences with the version of my own program. It is actually a huge structure and it is not efficient to put here completely.
As suggested, I get rid of -fno-gcse, and the generated asm is not huge as before.
Without -fno-gcse, sub_func_1 generates the same code as above.
For sub_func_2:
00000084 <sub_func_2>:
84: e3520000 cmp r2, #0
88: e92d0070 push {r4, r5, r6}
8c: 1a000030 bne 154 <sub_func_2+0xd0>
90: e30d30c0 movw r3, #53440 ; 0xd0c0
94: e3a06008 mov r6, #8
98: e3403001 movt r3, #1
9c: e0030093 mul r3, r3, r0
a0: e3a00d61 mov r0, #6208 ; 0x1840
a4: e0213190 mla r1, r0, r1, r3
a8: e59f30ac ldr r3, [pc, #172] ; 15c <sub_func_2+0xd8>
ac: e0831001 add r1, r3, r1
b0: e2813b19 add r3, r1, #25600 ; 0x6400
b4: e5d34083 ldrb r4, [r3, #131] ; 0x83
b8: e1a00003 mov r0, r3
bc: e5d35087 ldrb r5, [r3, #135] ; 0x87
c0: e5c36081 strb r6, [r3, #129] ; 0x81
c4: e5c34085 strb r4, [r3, #133] ; 0x85
c8: e3064488 movw r4, #25736 ; 0x6488
cc: e2833080 add r3, r3, #128 ; 0x80
d0: e7c15004 strb r5, [r1, r4]
d4: e5d05094 ldrb r5, [r0, #148] ; 0x94
d8: e0844006 add r4, r4, r6
dc: e7c15004 strb r5, [r1, r4]
e0: e5d04095 ldrb r4, [r0, #149] ; 0x95
e4: e5d0c092 ldrb ip, [r0, #146] ; 0x92
e8: e5c04091 strb r4, [r0, #145] ; 0x91
ec: e3064498 movw r4, #25752 ; 0x6498
f0: e7d15004 ldrb r5, [r1, r4]
f4: e5c0c093 strb ip, [r0, #147] ; 0x93
f8: e5d04099 ldrb r4, [r0, #153] ; 0x99
fc: e5c0509a strb r5, [r0, #154] ; 0x9a
100: e5c0409b strb r4, [r0, #155] ; 0x9b
104: e5d33005 ldrb r3, [r3, #5]
108: e3530001 cmp r3, #1
10c: 1a000010 bne 154 <sub_func_2+0xd0>
110: e281cc65 add ip, r1, #25856 ; 0x6500
114: e3a06005 mov r6, #5
118: e2810b19 add r0, r1, #25600 ; 0x6400
11c: e1a0500c mov r5, ip
120: e5cc3000 strb r3, [ip]
124: e1a0400c mov r4, ip
128: e5cc6001 strb r6, [ip, #1]
12c: e5cc2002 strb r2, [ip, #2]
130: e5d0608b ldrb r6, [r0, #139] ; 0x8b
134: e5cc6003 strb r6, [ip, #3]
138: e306c53c movw ip, #25916 ; 0x653c
13c: e7c1200c strb r2, [r1, ip]
140: e5c52041 strb r2, [r5, #65] ; 0x41
144: e285503c add r5, r5, #60 ; 0x3c
148: e5c43050 strb r3, [r4, #80] ; 0x50
14c: e284404c add r4, r4, #76 ; 0x4c
150: e5c0608f strb r6, [r0, #143] ; 0x8f
154: e8bd0070 pop {r4, r5, r6}
158: e12fff1e bx lr
15c: 00000000 .word 0x00000000
TL:DR: can't reproduce that insane compiler output. Maybe the surrounding code + LTO did it?
I do have suggestions to improve the code: see the stuff below about copying whole structs instead of copying many individual members.
The question you linked is about accessing a value-type global directly vs. through a global pointer. On ARM, where it takes multiple instructions or a load from a nearby constant to get an arbitrary 32bit pointer into a register, passing around pointers is better than having each function reference a global directly.
See this example on the Godbolt Compiler Explorer (ARM gcc 4.8.2 -O3)
struct example {
int a, b, c;
} global_example;
int load_global(void) { return global_example.c; }
movw r3, #:lower16:global_example # tmp113,
movt r3, #:upper16:global_example # tmp113,
ldr r0, [r3, #8] #, global_example.c
bx lr #
int load_pointer(struct example *p) { return p->c; }
ldr r0, [r0, #8] #, p_2(D)->c
bx lr #
(Apparently gcc is horrible at passing structs by val as function args, see the code for byval(struct example by_val) on the godbolt link.)
Even worse is if you have a global pointer: first you have to load the value of the pointer, then another load to dereference it. This is the indirection overhead that was being discussed in the question you linked. If both loads miss in cache, you're paying the round-trip latency twice. The load address for the 2nd load isn't available until the first load completes, so no pipelining of those memory requests is possible even on an out-of-order CPU.
If you already have a pointer as an arg, it will be in a register. Dereferencing it is the same as loading from a global. (But better, because you don't need to get the global's address into a register yourself.)
Your real code
I can't reproduce your massive asm output with ARM gcc 4.8.2 on Godbolt, or locally with ARM gcc 5.2.1. I'm not using LTO, though, since I don't have a complete test program.
All I can see is just slightly larger code to do some index math.
bfi is Bitfield Insert. I think 144: e7df9813 bfi r9, r3, #16, #16 is setting the top half of r9 = low half of r3. I don't see how that and mla (integer mul-accumulate) make much sense. Other than perverse results from -ftree-vectorize, all I can think of is maybe -fno-gcse has a really bad impact for the version of gcc you tested.
Is it manipulating constants that are going to be stored? The code you actually posted #defines everything to 0, which gcc takes advantage of. (It also takes advantage of the fact that it already has 1 in a register if curr_cnt == 1, and stores that register for the second_file_ptr->redundancy_version = 1;). ARM doesn't have a str [mem], immediate or anything like x86's mov [mem], imm.
If your compiler output is from code with different values for those constants, the compiler would be doing more work to store different things.
Unfortunately gcc is bad at merging narrow stores into a single wider store (long-standing missed-optimization bug). For x86, clang does this in at least one case, storing 0x0100 (256) instead of a 0 and a 1. (check on godbolt by flipping the compiler to clang 3.7.1 or something, and removing the ARM-specific compiler args. There's a mov word ptr \[rsi\], 256 where gcc uses
mov BYTE PTR [rsi], 0 # *first_file_ptr_23(D).status,
mov BYTE PTR [rsi+1], 1 # *first_file_ptr_23(D).type,
If you arranged your structs carefully, there would be more opportunities for copying 4B blocks in this function.
It might also help to have two identical sub-structs of curr and not-curr, instead of curr_size and size. You might have to declare it packed to avoid padding after the sub-structs, though. Your two groups of members aren't in exactly the same order, which prevents compilers from doing much block-copying anyway when you do a bunch of assignments.
It helps gcc and clang copy multiple bytes at once if you do:
struct output_s_optimized {
struct __attribute__((packed)) stuff {
unsigned char cnt,
mode,
type,
size,
allocation_type,
allocation_localized,
mode_enable;
} curr; // 7B
unsigned char curr_id; // no non-curr id?
struct stuff non_curr;
unsigned char layer_cnt;
// Another 8 byte boundary here
unsigned char total_layer_cnt;
struct cc_file_s cc_file[128];
};
void foo(struct output_s_optimized *p) {
p->curr_id = 0;
p->non_curr = p->curr;
}
void bar(struct output_s_optimized *output_ptr) {
output_ptr->curr_id = 0;
output_ptr->curr.cnt = output_ptr->non_curr.cnt;
output_ptr->curr.mode = output_ptr->non_curr.mode;
output_ptr->curr.type = output_ptr->non_curr.type;
output_ptr->curr.size = output_ptr->non_curr.size;
output_ptr->curr.allocation_type = output_ptr->non_curr.allocation_type;
output_ptr->curr.allocation_localized = output_ptr->non_curr.allocation_localized;
output_ptr->curr.mode_enable = output_ptr->non_curr.mode_enable;
}
gcc 4.8.2 compiles foo() to three copies: byte, 2B, and 4B, even on ARM. It compiles bar() to eight 1B copies, and so does clang-3.8 on x86. So copying whole structs can help your compiler a lot (as well as making sure the data to be copied is arranged in the same order in both locations).
the same code on x86: nothing new
You can use -fverbose-asm to put comments on each line. For x86, the compiler output from gcc 6.1 -O3 is very similar between versions, as you can see on the Godbolt Compiler Explorer. x86 addressing modes can index a global variable directly, so you see stuff like
movzx edi, BYTE PTR [rcx+10] # *output_ptr_7(D)._mode
# where rcx is the output_ptr arg, used directly
vs.
movzx ecx, BYTE PTR output_file[rdi+10] # output_file[output_index_7(D)]._mode
# where rdi = output_index * 1297 (sizeof(output_file[0])), calculated once at the start
(gcc apparently doesn't care that each instruction has a 4B displacement as part of the addressing mode, but this is an ARM question so I won't go tradeoffs between code-size and insn count with x86's variable-length insns.)
In broad (architecture-agnostic) terms, this is what your instructions do:
global_structure_pointer->field = value;
loads the value of global_structure_pointer into an addressing register.
adds the offset represented by field to the addressing register.
stores value into the memory location addressed by the addressing register.
global_structure[index].field = value;
loads the address of global_structure into an addressing register.
loads the value of index into an arithmetic register.
multiplies the arithmetic register by the size of global_structure.
adds the arithmetic register to the addressing register.
stores value into the memory location addressed by the addressing register.
Your confusion seems to be due to a misunderstanding of what "direct access" actually is.
THIS is direct access:
global_structure.field = value;
What you thought of as direct access is in fact indexed access.

Resources