Example of a static vs automatic variable in assembly - c

In C, we can use the following two examples to show the difference between a static and non-static variable:
for (int i = 0; i < 5; ++i) {
static int n = 0;
printf("%d ", ++n); // prints 1 2 3 4 5 - the value persists
}
And:
for (int i = 0; i < 5; ++i) {
int n = 0;
printf("%d ", ++n); // prints 1 1 1 1 1 - the previous value is lost
}
Source: this answer.
What would be the most basic example in assembly to show the difference between how a static or non-static variable is created? (Or does this concept not exist in assembly?)

To implement a static object in assembly, you define it in a data section (of which there are various types, involving options for initialization and modification).
To implement an automatic object in assembly, you include space for it in the stack frame of a routine.
Examples, not necessarily syntactically correct in a particular assembly language, might be:
.data
foo: .word 34 // Static object named "foo".
.text
…
lr r3, foo // Load value of foo.
and:
.text
bar: // Start of routine named "bar".
foo = 16 // Define a symbol for convenience.
add sp, sp, -CalculatedSize // Allocate stack space for local data.
…
li r3, 34 // Load immediate value into register.
sr r3, foo(sp) // Store value into space reserved for foo on stack.
…
add sp, sp, +CalculatedSize // Automatic objects are released here.
ret
These are very simplified examples (as requested). Many modern schemes for using the hardware stack include frame pointers, which are not included above.
In the second example, CalculatedSize represents some amount that includes space for registers to be saved, space for the foo object, space for arguments for subroutine calls, and whatever other stack space is needed by the routine. The offset of 16 provided for foo is part of those calculations; the author of the routine would arrange their stack frame largely as they desire.

Just try it
void more_fun ( int );
void fun0 ( void )
{
for (int i = 0; i < 500; ++i) {
static int n = 0;
more_fun(++n);
}
}
void fun1 ( void )
{
for (int i = 0; i < 500; ++i) {
int n = 0;
more_fun( ++n);
}
}
Disassembly of section .text:
00000000 <fun0>:
0: e92d4070 push {r4, r5, r6, lr}
4: e3a04f7d mov r4, #500 ; 0x1f4
8: e59f501c ldr r5, [pc, #28] ; 2c <fun0+0x2c>
c: e5953000 ldr r3, [r5]
10: e2833001 add r3, r3, #1
14: e1a00003 mov r0, r3
18: e5853000 str r3, [r5]
1c: ebfffffe bl 0 <more_fun>
20: e2544001 subs r4, r4, #1
24: 1afffff8 bne c <fun0+0xc>
28: e8bd8070 pop {r4, r5, r6, pc}
2c: 00000000
00000030 <fun1>:
30: e92d4010 push {r4, lr}
34: e3a04f7d mov r4, #500 ; 0x1f4
38: e3a00001 mov r0, #1
3c: ebfffffe bl 0 <more_fun>
40: e2544001 subs r4, r4, #1
44: 1afffffb bne 38 <fun1+0x8>
48: e8bd8010 pop {r4, pc}
Disassembly of section .bss:
00000000 <n.4158>:
0: 00000000 andeq r0, r0, r0
I like to think of static locals as local globals. They sit in .bss or .data just like globals. But from a C perspective they can only be accessed within the function/context that they were created in.
I local variable has no need for long term storage, so it is "created" and destroyed within that fuction. If we were to not optimize you would see that
some stack space is allocated.
00000064 <fun1>:
64: e92d4800 push {fp, lr}
68: e28db004 add fp, sp, #4
6c: e24dd008 sub sp, sp, #8
70: e3a03000 mov r3, #0
74: e50b300c str r3, [fp, #-12]
78: ea000009 b a4 <fun1+0x40>
7c: e3a03000 mov r3, #0
80: e50b3008 str r3, [fp, #-8]
84: e51b3008 ldr r3, [fp, #-8]
88: e2833001 add r3, r3, #1
8c: e50b3008 str r3, [fp, #-8]
90: e51b0008 ldr r0, [fp, #-8]
94: ebfffffe bl 0 <more_fun>
98: e51b300c ldr r3, [fp, #-12]
9c: e2833001 add r3, r3, #1
a0: e50b300c str r3, [fp, #-12]
a4: e51b300c ldr r3, [fp, #-12]
a8: e3530f7d cmp r3, #500 ; 0x1f4
ac: bafffff2 blt 7c <fun1+0x18>
b0: e1a00000 nop ; (mov r0, r0)
b4: e24bd004 sub sp, fp, #4
b8: e8bd8800 pop {fp, pc}
But optimized for fun1 the local variable is kept in a register, faster than keeping on the stack, in this solution they save the upstream value held in r4 so
that r4 can be used to hold n within this function, when the function returns there is no more need for n per the rules of the language.
For the static local, per the rules of the language that value remains static
outside the function and can be accessed within. Because it is initialized to 0 it lives in .bss not .data (gcc, and many others). In the code above the linker
will fill this value
2c: 00000000
in with the address to this
00000000 <n.4158>:
0: 00000000 andeq r0, r0, r0
IMO one could argue the implementation didnt need to treat it like a volatile and
sample and save n every loop. Could have basically implemented it like the second
function, but sampled up front from memory and saved it in the end. Either way
you can see the difference in an implementation of the high level code. The non-static local only lives within the function and then its storage anc contents
are essentially gone.

When modifying a variable, the static local variable modified by static is executed only once, and the life cycle of the local variable is extended until the program is run.
If you don't add static, each loop reallocates the value.

Related

How do i know if the compiler will optimize a variable?

I am new to Microcontrollers. I have read a lot of Articles and documentations about volatile variables in c. What i understood, is that while using volatile we are telling the compiler not to cache either to optimize the variable. However i still didnt get when this should really be used.For example let's say i have a simple counter and for loop like this.
for(int i=0; i < blabla.length; i++) {
//code here
}
or maybe when i write a simple piece of code like this
int i=1;
int j=1;
printf("the sum is: %d\n" i+j);
I have never cared about compiler optimization for such examples. But in many scopes if the variable is not declared as volatile the ouptut wont be as expected. How would i know that i have to care about compiler optimization in other examples?
Simple example:
int flag = 1;
while (flag)
{
do something that doesn't involve flag
}
This can be optimized to:
while (true)
{
do something
}
because the compiler knows that flag never changes.
with this code:
volatile int flag = 1;
while (flag)
{
do something that doesn't involve flag
}
nothing will be optimized, because now the compiler knows: "although the program doesn't change flag inside the while loop, it might changed anyway".
According to cppreference:
volatile object - an object whose type is volatile-qualified, or a subobject of a volatile object, or a mutable subobject of a const-volatile object. Every access (read or write operation, member function call, etc.) made through a glvalue expression of volatile-qualified type is treated as a visible side-effect for the purposes of optimization (that is, within a single thread of execution, volatile accesses cannot be optimized out or reordered with another visible side effect that is sequenced-before or sequenced-after the volatile access. This makes volatile objects suitable for communication with a signal handler, but not with another thread of execution, see std::memory_order). Any attempt to refer to a volatile object through a non-volatile glvalue (e.g. through a reference or pointer to non-volatile type) results in undefined behavior.
This explains why some optimizations can’t be made by the compiler since it can’t entirely predict when its value will be modified at compile-time. This qualifier is useful to indicate to the compiler that it shouldn’t do these optimizations because its value can be changed in a way unknown by the compiler.
I have not worked recently with microcontrollers but I think that the states of different electrical input and output pins have to be marked as volatile since the compiler doesn’t know that they can be changed externally. (In this case by means other than code like when you plug-in a component).
Just try it. First off there is the language and what is possible to be optimized and then there is what the compiler actual figures out and optimizes, if it can be optimized does not mean the compiler will figure it out nor will it always produce the code you think.
Volatile has nothing to do with caching of any kind, did not we just get this question recently using that term? Volatile indicates to the compiler that the variable should not be optimized into a register or optimized away. Let us say "all" accesses to that variable must go back to memory, although different compilers have a different understanding of how to use volatile, I have seen clang (llvm) and gcc (gnu) disagree, when the variable was used twice in a row or something like that clang didnt do two reads it only did one.
It was a Stack Overflow question you are welcome to search for it, the clang code was slightly faster than gcc, simply because of one less instruction because of differences of opinion of how to implement volatile. So even there the main compiler folks can't agree on what it really means. Its the nature of the C language, lots of implementation defined features and pro tip, avoid them volatile, bitfields, unions, etc, certainly across compile domains.
void fun0 ( void )
{
unsigned int i;
unsigned int len;
len = 5;
for(i=0; i < len; i++)
{
}
}
00000000 <fun0>:
0: 4770 bx lr
This is completely dead code, it does noting it touches nothing, all the items are local, so it can all go away, simply return.
unsigned int fun1 ( void )
{
unsigned int i;
unsigned int len;
len = 5;
for(i=0; i < len; i++)
{
}
return i;
}
00000004 <fun1>:
4: 2005 movs r0, #5
6: 4770 bx lr
This one returns something, the compiler can figure out it is counting and the last value after the loop is what gets returned....so just return that value, no need for variables or any other code generation, the rest is dead code.
unsigned int fun2 ( unsigned int len )
{
unsigned int i;
for(i=0; i < len; i++)
{
}
return i;
}
00000008 <fun2>:
8: 4770 bx lr
Like fun1 except the value is passed in in a register, just happens to be the same register as the return value for the ABI for this target. So you do not even have to copy the length to the return value in this case, for other architectures or ABIs we would hope that this optimizes to return = len and that gets sent back. A simple mov instruction.
unsigned int fun3 ( unsigned int len )
{
volatile unsigned int i;
for(i=0; i < len; i++)
{
}
return i;
}
0000000c <fun3>:
c: 2300 movs r3, #0
e: b082 sub sp, #8
10: 9301 str r3, [sp, #4]
12: 9b01 ldr r3, [sp, #4]
14: 4298 cmp r0, r3
16: d905 bls.n 24 <fun3+0x18>
18: 9b01 ldr r3, [sp, #4]
1a: 3301 adds r3, #1
1c: 9301 str r3, [sp, #4]
1e: 9b01 ldr r3, [sp, #4]
20: 4283 cmp r3, r0
22: d3f9 bcc.n 18 <fun3+0xc>
24: 9801 ldr r0, [sp, #4]
26: b002 add sp, #8
28: 4770 bx lr
2a: 46c0 nop ; (mov r8, r8)
it gets significantly different here, that is a lot of code compared to the ones thus far. We would like to think that volatile indicates all uses of that variable touch the memory for that variable.
12: 9b01 ldr r3, [sp, #4]
14: 4298 cmp r0, r3
16: d905 bls.n 24 <fun3+0x18>
get i and compare it to len is it less than? we are done exit loop
18: 9b01 ldr r3, [sp, #4]
1a: 3301 adds r3, #1
1c: 9301 str r3, [sp, #4]
i was less than len so we need to increment it, read it, change it, write it back.
1e: 9b01 ldr r3, [sp, #4]
20: 4283 cmp r3, r0
22: d3f9 bcc.n 18 <fun3+0xc>
do the i < len test again, see if it is less than or greater than and loop again or do not.
24: 9801 ldr r0, [sp, #4]
get i from ram so it can be returned.
All reads and writes of i involved the memory that holds i. Because we asked for that now the loop is not dead code each iteration has to be implemented in order to handle all the touches of that variable on memory.
void fun4 ( void )
{
unsigned int a;
unsigned int b;
a = 1;
b = 1;
fun3(a+b);
}
0000002c <fun4>:
2c: 2300 movs r3, #0
2e: b082 sub sp, #8
30: 9301 str r3, [sp, #4]
32: 9b01 ldr r3, [sp, #4]
34: 2b01 cmp r3, #1
36: d805 bhi.n 44 <fun4+0x18>
38: 9b01 ldr r3, [sp, #4]
3a: 3301 adds r3, #1
3c: 9301 str r3, [sp, #4]
3e: 9b01 ldr r3, [sp, #4]
40: 2b01 cmp r3, #1
42: d9f9 bls.n 38 <fun4+0xc>
44: 9b01 ldr r3, [sp, #4]
46: b002 add sp, #8
48: 4770 bx lr
4a: 46c0 nop ; (mov r8, r8)
this both optimized out the addition and the a and b variables but also optimized by inlining the fun3 function.
void fun5 ( void )
{
volatile unsigned int a;
unsigned int b;
a = 1;
b = 1;
fun3(a+b);
}
0000004c <fun5>:
4c: 2301 movs r3, #1
4e: b082 sub sp, #8
50: 9300 str r3, [sp, #0]
52: 2300 movs r3, #0
54: 9a00 ldr r2, [sp, #0]
56: 9301 str r3, [sp, #4]
58: 9b01 ldr r3, [sp, #4]
5a: 3201 adds r2, #1
5c: 429a cmp r2, r3
5e: d905 bls.n 6c <fun5+0x20>
60: 9b01 ldr r3, [sp, #4]
62: 3301 adds r3, #1
64: 9301 str r3, [sp, #4]
66: 9b01 ldr r3, [sp, #4]
68: 429a cmp r2, r3
6a: d8f9 bhi.n 60 <fun5+0x14>
6c: 9b01 ldr r3, [sp, #4]
6e: b002 add sp, #8
70: 4770 bx lr
Also fun3 is inlined, but the a variable is read from memory every time
instead of being optimized out
58: 9b01 ldr r3, [sp, #4]
5a: 3201 adds r2, #1
void fun6 ( void )
{
unsigned int i;
unsigned int len;
len = 5;
for(i=0; i < len; i++)
{
fun3(i);
}
}
00000074 <fun6>:
74: 2300 movs r3, #0
76: 2200 movs r2, #0
78: 2100 movs r1, #0
7a: b082 sub sp, #8
7c: 9301 str r3, [sp, #4]
7e: 9b01 ldr r3, [sp, #4]
80: 3201 adds r2, #1
82: 9b01 ldr r3, [sp, #4]
84: 2a05 cmp r2, #5
86: d00d beq.n a4 <fun6+0x30>
88: 9101 str r1, [sp, #4]
8a: 9b01 ldr r3, [sp, #4]
8c: 4293 cmp r3, r2
8e: d2f7 bcs.n 80 <fun6+0xc>
90: 9b01 ldr r3, [sp, #4]
92: 3301 adds r3, #1
94: 9301 str r3, [sp, #4]
96: 9b01 ldr r3, [sp, #4]
98: 429a cmp r2, r3
9a: d8f9 bhi.n 90 <fun6+0x1c>
9c: 3201 adds r2, #1
9e: 9b01 ldr r3, [sp, #4]
a0: 2a05 cmp r2, #5
a2: d1f1 bne.n 88 <fun6+0x14>
a4: b002 add sp, #8
a6: 4770 bx lr
This one I found interesting, could have been optimized better, based on my gnu experience kind of confused, but as pointed out, this is how it is, you can expect one thing but the compiler does what it does.
9c: 3201 adds r2, #1
9e: 9b01 ldr r3, [sp, #4]
a0: 2a05 cmp r2, #5
The i variable in the fun6 function is put on the stack for some reason, it is not volatile it does not desire that kind of access every time. But that is how they implemented it.
If I build with an older version of gcc I see this
9c: 3201 adds r2, #1
9e: 9b01 ldr r3, [sp, #4]
a0: 2a05 cmp r2, #5
Another thing to note is that gnu at least is not getting better every version, it has been at times getting worse, this is a simple case.
void fun7 ( void )
{
unsigned int i;
unsigned int len;
len = 5;
for(i=0; i < len; i++)
{
fun2(i);
}
}
0000013c <fun7>:
13c: e12fff1e bx lr
Okay too extreme (no surprise in the result), let us try this
void more_fun ( unsigned int );
void fun8 ( void )
{
unsigned int i;
unsigned int len;
len = 5;
for(i=0; i < len; i++)
{
more_fun(i);
}
}
000000ac <fun8>:
ac: b510 push {r4, lr}
ae: 2000 movs r0, #0
b0: f7ff fffe bl 0 <more_fun>
b4: 2001 movs r0, #1
b6: f7ff fffe bl 0 <more_fun>
ba: 2002 movs r0, #2
bc: f7ff fffe bl 0 <more_fun>
c0: 2003 movs r0, #3
c2: f7ff fffe bl 0 <more_fun>
c6: 2004 movs r0, #4
c8: f7ff fffe bl 0 <more_fun>
cc: bd10 pop {r4, pc}
ce: 46c0 nop ; (mov r8, r8)
No surprise there it chose to unroll it because 5 is below some threshold.
void fun9 ( unsigned int len )
{
unsigned int i;
for(i=0; i < len; i++)
{
more_fun(i);
}
}
000000d0 <fun9>:
d0: b570 push {r4, r5, r6, lr}
d2: 1e05 subs r5, r0, #0
d4: d006 beq.n e4 <fun9+0x14>
d6: 2400 movs r4, #0
d8: 0020 movs r0, r4
da: 3401 adds r4, #1
dc: f7ff fffe bl 0 <more_fun>
e0: 42a5 cmp r5, r4
e2: d1f9 bne.n d8 <fun9+0x8>
e4: bd70 pop {r4, r5, r6, pc}
That is what I was looking for. So in this case the i variable is in a register (r4) not on the stack as shown above. The calling convention for this says r4 and some number of others after it (r5,r6,...) must be preserved. This is calling an external function which the optimizer can't see, so it has to implement the loop so that the function is called that many times with each of the values in order. Not dead code.
Textbook/classroom implies that local variables are on the stack, but they do not have to be. i is not declared volatile so instead take a non-volatile register, r4 save that on the stack so the caller does not lose its state, use r4 as i and the callee function more_fun either will not touch it or will return it as it found it. You add a push, but save a bunch of loads and stores in the loop, yet another optimization based on the target and the ABI.
Volatile is a suggestion/recommendation/desire to the compiler that it have an address for the variable and perform actual load and store accesses to that variable when used. Ideally for use cases like when you have a control/status register in a peripheral in hardware that you need all of the accesses described in the code to happen in the order coded, no optimization. As to a cache that is independent of the language you have to setup the cache and the mmu or other solution so that control and status registers do not get cached and the peripheral is not touched when we wanted it to be touched. Takes both layers you need to tell the compiler to do all the accesses and need to not block those accesses in the memory system.
Without volatile and based on the command line options you use and the list of optimizations the compiler has been programmed to attempt to perform the compiler will try to perform those optimizations as they are programmed in the compilers code. If the compiler can't see into a calling function like more_fun above because it is not in this optimization domain then the compiler must functionally represent all the calls in order, if it can see and inlining is allowed then the compiler can if programmed to do so essentially pull the function inline with the caller THEN optimize that whole blob as if it were one function based on other available options. Not uncommon to have the callee function be bulky because of its nature, but when specific values are passed by a caller and the compiler can see all of it the caller plus callee code can be smaller than the callee implementation.
You will often see folks wanting to for example learn assembly language by examining the output of a compiler do something like this:
void fun10 ( void )
{
int a;
int b;
int c;
a = 5;
b = 6;
c = a + b;
}
not realizing that that is dead code and should be optimized out if an optimizer is used, they ask a Stack Overflow question and someone says you need to turn the optimizer off, now you get a lot of loads and stores have to understand and keep track of stack offsets and while it is valid asm code you can study it is not what you were hoping for, instead something like this is more valuable to that effort
unsigned int fun11 ( unsigned int a, unsigned int b )
{
return(a+b);
}
The inputs are unknown to the compiler and a return value is required so it can't dead code this it has to implement it.
And this is a simple case of demonstrating the caller plus callee is smaller than the callee
000000ec <fun11>:
ec: 1840 adds r0, r0, r1
ee: 4770 bx lr
000000f0 <fun12>:
f0: 2007 movs r0, #7
f2: 4770 bx lr
While that may not look simpler it has inlined the code, it has optimized out the a = 3, b = 4 assignments, optimized out the addition operation and simply pre-computed the result and returned it.
Certainly with gcc you can cherry pick the optimizations you want to add or block there is a laundry list of them that you can go research.
With very little practice you can see what is optimizable at least within the view of the function but then hope the compiler figures it out. Certainly visualizing inline takes more work but really it is the same you just visually inline it.
Now there are ways with gnu and llvm to optimize across files, basically whole project so more_fun would be visible now and the functions that call it might get further optimized than what you see in the object of the one file with the caller. Takes certain command lines on the compile and/or link for this to work and I have not memorized them. With llvm there is a way to merge bytecode and then optimize that, but it does not always do what you hoped it would do as far as a whole project optimization.

Self written simple memset not working with -03 eabi gcc on ARMv7

I wrote a very simple memset in c that works fine up to -O2 but not with -O3...
memset:
void * memset(void * blk, int c, size_t n)
{
unsigned char * dst = blk;
while (n-- > 0)
*dst++ = (unsigned char)c;
return blk;
}
...which compiles to this assembly when using -O2:
20000430 <memset>:
20000430: e3520000 cmp r2, #0 # compare param 'n' with zero
20000434: 012fff1e bxeq lr # if equal return to caller
20000438: e6ef1071 uxtb r1, r1 # else zero extend (extract byte from) param 'c'
2000043c: e0802002 add r2, r0, r2 # add pointer 'blk' to 'n'
20000440: e1a03000 mov r3, r0 # move pointer 'blk' to r3
20000444: e4c31001 strb r1, [r3], #1 # store value of 'c' to address of r3, increment r3 for next pass
20000448: e1530002 cmp r3, r2 # compare current store address to calculated max address
2000044c: 1afffffc bne 20000444 <memset+0x14> # if not equal store next byte
20000450: e12fff1e bx lr # else back to caller
This makes sense to me. I annotated what happens here.
When I compile it with -O3 the program crashes. My memset calls itself repeatedly until it ate the whole stack:
200005e4 <memset>:
200005e4: e3520000 cmp r2, #0 # compare param 'n' with zero
200005e8: e92d4010 push {r4, lr} # ? (1)
200005ec: e1a04000 mov r4, r0 # move pointer 'blk' to r4 (temp to hold return value)
200005f0: 0a000001 beq 200005fc <memset+0x18> # if equal (first line compare) jump to epilogue
200005f4: e6ef1071 uxtb r1, r1 # zero extend (extract byte from) param 'c'
200005f8: ebfffff9 bl 200005e4 <memset> # call myself ? (2)
200005fc: e1a00004 mov r0, r4 # epilogue start. move return value to r0
20000600: e8bd8010 pop {r4, pc} # restore r4 and back to caller
I can't figure out how this optimised version is supposed to work without any strb or similar. It doesn't matter if I try to set the memory to '0' or something else so the function is not only called on .bss (zero initialised) variables.
(1) This is a problem. This push gets endlessly repeated without a matching pop as it's called by (2) when the function doesn't early-exit because of 'n' being zero. I verified this with uart prints. Also r2 is never touched so why should the compare to zero ever become true?
Please help me understand what's happening here. Is the compiler assuming prerequisites that I may not fulfill?
Background: I'm using external code that requires memset in my baremetal project so I rolled my own. It's only used once on startup and not performance critical.
/edit: The compiler is called with these options:
arm-none-eabi-gcc -O3 -Wall -Wextra -fPIC -nostdlib -nostartfiles -marm -fstrict-volatile-bitfields -march=armv7-a -mcpu=cortex-a9 -mfloat-abi=hard -mfpu=neon-vfpv3
Your first question (1). That is per the calling convention if you are going to make a nested function call you need to preserve the link register, and you need to be 64 bit aligned. The code uses r4 so that is the extra register saved. No magic there.
Your second question (2) it is not calling your memset it is optimizing your code because it sees it as an inefficient memset. Fuz has provided the answers to your question.
Rename the function
00000000 <xmemset>:
0: e3520000 cmp r2, #0
4: e92d4010 push {r4, lr}
8: e1a04000 mov r4, r0
c: 0a000001 beq 18 <xmemset+0x18>
10: e6ef1071 uxtb r1, r1
14: ebfffffe bl 0 <memset>
18: e1a00004 mov r0, r4
1c: e8bd8010 pop {r4, pc}
and you can see this.
If you were to use -ffreestanding as Fuz recommended then you see this or something like it
00000000 <xmemset>:
0: e3520000 cmp r2, #0
4: 012fff1e bxeq lr
8: e92d41f0 push {r4, r5, r6, r7, r8, lr}
c: e2426001 sub r6, r2, #1
10: e3560002 cmp r6, #2
14: e6efe071 uxtb lr, r1
18: 9a00002a bls c8 <xmemset+0xc8>
1c: e3a0c000 mov r12, #0
20: e3520023 cmp r2, #35 ; 0x23
24: e7c7c01e bfi r12, lr, #0, #8
28: e1a04122 lsr r4, r2, #2
2c: e7cfc41e bfi r12, lr, #8, #8
30: e7d7c81e bfi r12, lr, #16, #8
34: e7dfcc1e bfi r12, lr, #24, #8
38: 9a000024 bls d0 <xmemset+0xd0>
3c: e2445009 sub r5, r4, #9
40: e1a03000 mov r3, r0
44: e3c55007 bic r5, r5, #7
48: e3a07000 mov r7, #0
4c: e2851008 add r1, r5, #8
50: e1570005 cmp r7, r5
54: f5d3f0a0 pld [r3, #160] ; 0xa0
58: e1a08007 mov r8, r7
5c: e583c000 str r12, [r3]
60: e583c004 str r12, [r3, #4]
64: e2877008 add r7, r7, #8
68: e583c008 str r12, [r3, #8]
6c: e2833020 add r3, r3, #32
70: e503c014 str r12, [r3, #-20] ; 0xffffffec
74: e503c010 str r12, [r3, #-16]
78: e503c00c str r12, [r3, #-12]
7c: e503c008 str r12, [r3, #-8]
80: e503c004 str r12, [r3, #-4]
84: 1afffff1 bne 50 <xmemset+0x50>
88: e2811001 add r1, r1, #1
8c: e483c004 str r12, [r3], #4
90: e1540001 cmp r4, r1
94: 8afffffb bhi 88 <xmemset+0x88>
98: e3c23003 bic r3, r2, #3
9c: e1520003 cmp r2, r3
a0: e0466003 sub r6, r6, r3
a4: e0803003 add r3, r0, r3
a8: 08bd81f0 popeq {r4, r5, r6, r7, r8, pc}
ac: e3560000 cmp r6, #0
b0: e5c3e000 strb lr, [r3]
b4: 08bd81f0 popeq {r4, r5, r6, r7, r8, pc}
b8: e3560001 cmp r6, #1
bc: e5c3e001 strb lr, [r3, #1]
c0: 15c3e002 strbne lr, [r3, #2]
c4: e8bd81f0 pop {r4, r5, r6, r7, r8, pc}
c8: e1a03000 mov r3, r0
cc: eafffff6 b ac <xmemset+0xac>
d0: e1a03000 mov r3, r0
d4: e3a01000 mov r1, #0
d8: eaffffea b 88 <xmemset+0x88>
which appears like it simply inlined memset, the one it knows not your code (the faster one).
So if you want it to use your code then stick with -O2. Yours is pretty inefficient so not sure why you need to push it any further than it was.
20000444: e4c31001 strb r1, [r3], #1 # store value of 'c' to address of r3, increment r3 for next pass
20000448: e1530002 cmp r3, r2 # compare current store address to calculated max address
2000044c: 1afffffc bne 20000444 <memset+0x14> # if not equal store next byte
It isn't going to get any better than that without replacing your code with something else.
Fuz already answered the question:
Compile with -fno-builtin-memset. The compiler recognises that the function implements memset and thus replaces it with a call to memset. You should in general compile with -ffreestanding when writing bare-metal code. I believe this fixes this sort of problem, too
It is replacing your code with memset, if you want it not to do that use -ffreestanding.
If you wish to go beyond that and wonder why -fno-builtin-memset didn't work that is a question for the gcc folks, file a ticket, let us know what they say (or just look at the compiler source code).

How does runtime memory allocation for local variable's work in embedded systems?

We are using ccrx compiler, embOS RTOS
There is a function in the code,
void fun( )
{
if(condition)
{ int a;}
else if(condition1)
{int b;}............
else
{ int z;}
}
Whenever the function gets called, related thread stack overflows. If few of the int variable declarations are commented, the thread stack does not get overflowed.
How is the stack allocated? Doesn't the memory get allocated after a condition becomes successful?
Lets take GCC for example
void fun( int condition, int condition1 )
{
if(condition)
{ int a; a=5;}
else if(condition1)
{int b; b=7;}
else
{ int z; z=9; }
}
and pick a target, not going to pay for or whatever to get ccrx...
00000000 <fun>:
0: e52db004 push {r11} ; (str r11, [sp, #-4]!)
4: e28db000 add r11, sp, #0
8: e24dd01c sub sp, sp, #28
c: e50b0018 str r0, [r11, #-24] ; 0xffffffe8
10: e50b101c str r1, [r11, #-28] ; 0xffffffe4
14: e51b3018 ldr r3, [r11, #-24] ; 0xffffffe8
18: e3530000 cmp r3, #0
1c: 0a000002 beq 2c <fun+0x2c>
20: e3a03005 mov r3, #5
24: e50b3008 str r3, [r11, #-8]
28: ea000007 b 4c <fun+0x4c>
2c: e51b301c ldr r3, [r11, #-28] ; 0xffffffe4
30: e3530000 cmp r3, #0
34: 0a000002 beq 44 <fun+0x44>
38: e3a03007 mov r3, #7
3c: e50b300c str r3, [r11, #-12]
40: ea000001 b 4c <fun+0x4c>
44: e3a03009 mov r3, #9
48: e50b3010 str r3, [r11, #-16]
4c: e1a00000 nop ; (mov r0, r0)
50: e28bd000 add sp, r11, #0
54: e49db004 pop {r11} ; (ldr r11, [sp], #4)
58: e12fff1e bx lr
without the allocations
void fun( int condition, int condition1 )
{
if(condition)
{ int a;/* a=5;*/}
else if(condition1)
{int b;/* b=7;*/}
else
{ int z; /*z=9;*/ }
}
even with no optimization these variables are dead code and optimized out
00000000 <fun>:
0: e52db004 push {r11} ; (str r11, [sp, #-4]!)
4: e28db000 add r11, sp, #0
8: e24dd00c sub sp, sp, #12
c: e50b0008 str r0, [r11, #-8]
10: e50b100c str r1, [r11, #-12]
14: e1a00000 nop ; (mov r0, r0)
18: e28bd000 add sp, r11, #0
1c: e49db004 pop {r11} ; (ldr r11, [sp], #4)
20: e12fff1e bx lr
There are some bytes on the stack for alignment, they could have shaved some off and stayed aligned but that is another topic.
The point here is that just because in the high level language your variables are only used in a portion of the function doesnt mean that the compiler has to do it that way, compilers certainly gcc, tend to do all of their stack allocation at the beginning of the function and cleanup at the end. As was done here...
This is not unlike
int fun( void )
{
static int x;
x++;
if(x>10) return(1);
if(fun()) return(1);
return(0);
}
which gives
00000000 <fun>:
0: e59f2030 ldr r2, [pc, #48] ; 38 <fun+0x38>
4: e5923000 ldr r3, [r2]
8: e2833001 add r3, r3, #1
c: e353000a cmp r3, #10
10: e5823000 str r3, [r2]
14: da000001 ble 20 <fun+0x20>
18: e3a00001 mov r0, #1
1c: e12fff1e bx lr
20: e92d4010 push {r4, lr}
24: ebfffffe bl 0 <fun>
28: e2900000 adds r0, r0, #0
2c: 13a00001 movne r0, #1
30: e8bd4010 pop {r4, lr}
34: e12fff1e bx lr
38: 00000000 andeq r0, r0, r0
Disassembly of section .bss:
00000000 <x.4089>:
0: 00000000 andeq r0, r0, r0
it is a local variable but by being static goes into the global pool, not allocated on the stack like other local variables (or optimized into registers).
Although interesting that this was a case where it didnt allocate on the stack right away, although that is good in this case, dont want to burden the stack with recursion if you dont have to. Nice optimization.
No reason to assume that the stack pointer will change multiple times throughout the function because of what you did in the high level language. My guess is it makes developing the compiler easier to do it all in one shot up front even though that is wasteful on memory. On the other hand as you go would cost more instructions (space and time).

arm-none-eabi-gcc C Pointers

So working with C in the arm-none-eabi-gcc. I have been having an issue with pointers, they don't seem to exists. Perhaps I'm passing the wrong cmds to the compiler.
Here is an example.
unsigned int * gpuPointer = GetGPU_Pointer(framebufferAddress);
unsigned int color = 16;
int y = 768;
int x = 1024;
while(y >= 0)
{
while(x >= 0)
{
*gpuPointer = color;
color = color + 2;
x--;
}
color++;
y--;
x = 1024;
}
and the output from the disassembler.
81c8: ebffffc3 bl 80dc <GetGPU_Pointer>
81cc: e3a0c010 mov ip, #16 ; 0x10
81d0: e28c3b02 add r3, ip, #2048 ; 0x800
81d4: e2833002 add r3, r3, #2 ; 0x2
81d8: e1a03803 lsl r3, r3, #16
81dc: e1a01823 lsr r1, r3, #16
81e0: e1a0300c mov r3, ip
81e4: e1a02003 mov r2, r3
81e8: e2833002 add r3, r3, #2 ; 0x2
81ec: e1a03803 lsl r3, r3, #16
81f0: e1a03823 lsr r3, r3, #16
81f4: e1530001 cmp r3, r1
81f8: 1afffff9 bne 81e4 <setup_framebuffer+0x5c>
Shouldn't there be a str cmd around 81e4? To add further the GetGPU_Pointer is coming from an assembler file but there is a declaration as so.
extern unsigned int * GetGPU_Pointer(unsigned int framebufferAddress);
My gut feeling is its something absurdly simple but I'm missing it.
You never change the value of gpuPointer and you haven't declared it to point to a volatile. So from the compiler's perspective you are overwriting a single memory location (*gpuPointer) 768*1024 times, but since you never use the value you are writing into it, the compiler is entitled to optimize by doing a single write at the end of the loop.
Adding to rici's answer (upvote rici not me)...
It gets even better, taking what you offered and wrapping it
extern unsigned int * GetGPU_Pointer ( unsigned int );
void fun ( unsigned int framebufferAddress )
{
unsigned int * gpuPointer = GetGPU_Pointer(framebufferAddress);
unsigned int color = 16;
int y = 768;
int x = 1024;
while(y >= 0)
{
while(x >= 0)
{
*gpuPointer = color;
color = color + 2;
x--;
}
color++;
y--;
x = 1024;
}
}
Optimizes to
00000000 <fun>:
0: e92d4008 push {r3, lr}
4: ebfffffe bl 0 <GetGPU_Pointer>
8: e59f3008 ldr r3, [pc, #8] ; 18 <fun+0x18>
c: e5803000 str r3, [r0]
10: e8bd4008 pop {r3, lr}
14: e12fff1e bx lr
18: 00181110 andseq r1, r8, r0, lsl r1
because the code really doesnt do anything but that one store.
Now if you were to modify the pointer
while(x >= 0)
{
*gpuPointer = color;
gpuPointer++;
color = color + 2;
x--;
}
then you get the store you were looking for
00000000 <fun>:
0: e92d4010 push {r4, lr}
4: ebfffffe bl 0 <GetGPU_Pointer>
8: e59f403c ldr r4, [pc, #60] ; 4c <fun+0x4c>
c: e1a02000 mov r2, r0
10: e3a0c010 mov ip, #16
14: e2820a01 add r0, r2, #4096 ; 0x1000
18: e2801004 add r1, r0, #4
1c: e1a0300c mov r3, ip
20: e4823004 str r3, [r2], #4
24: e1520001 cmp r2, r1
28: e2833002 add r3, r3, #2
2c: 1afffffb bne 20 <fun+0x20>
30: e28ccb02 add ip, ip, #2048 ; 0x800
34: e28cc003 add ip, ip, #3
38: e15c0004 cmp ip, r4
3c: e2802004 add r2, r0, #4
40: 1afffff3 bne 14 <fun+0x14>
44: e8bd4010 pop {r4, lr}
48: e12fff1e bx lr
4c: 00181113 andseq r1, r8, r3, lsl r1
or if you make it volatile (and then dont have to modify it)
volatile unsigned int * gpuPointer = GetGPU_Pointer(framebufferAddress);
then
00000000 <fun>:
0: e92d4008 push {r3, lr}
4: ebfffffe bl 0 <GetGPU_Pointer>
8: e59fc02c ldr ip, [pc, #44] ; 3c <fun+0x3c>
c: e3a03010 mov r3, #16
10: e2831b02 add r1, r3, #2048 ; 0x800
14: e2812002 add r2, r1, #2
18: e5803000 str r3, [r0]
1c: e2833002 add r3, r3, #2
20: e1530002 cmp r3, r2
24: 1afffffb bne 18 <fun+0x18>
28: e2813003 add r3, r1, #3
2c: e153000c cmp r3, ip
30: 1afffff6 bne 10 <fun+0x10>
34: e8bd4008 pop {r3, lr}
38: e12fff1e bx lr
3c: 00181113 andseq r1, r8, r3, lsl r1
then you get your store
arm-none-eabi-gcc -O2 -c a.c -o a.o
arm-none-eabi-objdump -D a.o
arm-none-eabi-gcc (GCC) 4.8.2
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
The problem is, as written, you didnt tell the compiler to update the pointer more than the one time. So as in my first example it has no reason to even implement the loop, it can pre-compute the answer and write it one time. In order to force the compiler to implement the loop and write to the pointer more than one time, you either need to make it volatile and/or modify it, depends on what you were really needing to do.

Local variable location in memory

For a homework assignment I have been given some c files, and compiled them using arm-linux-gcc (we will eventually be targeting gumstix boards, but for these exercises we have been working with qemu and ema).
One of the questions confuses me a bit-- we are told to:
Use arm-linux-objdump to find the location of variables declared in main() in the executable binary.
However, these variables are local and thus shouldn't have addresses until runtime, correct?
I'm thinking that maybe what I need to find is the offset in the stack frame, which can in fact be found using objdump (not that I know how).
Anyways, any insight into the matter would be greatly appreciated, and I would be happy to post the source code if necessary.
unsigned int one ( unsigned int, unsigned int );
unsigned int two ( unsigned int, unsigned int );
unsigned int myfun ( unsigned int x, unsigned int y, unsigned int z )
{
unsigned int a,b;
a=one(x,y);
b=two(a,z);
return(a+b);
}
compile and disassemble
arm-none-eabi-gcc -c fun.c -o fun.o
arm-none-eabi-objdump -D fun.o
code created by compiler
00000000 <myfun>:
0: e92d4800 push {fp, lr}
4: e28db004 add fp, sp, #4
8: e24dd018 sub sp, sp, #24
c: e50b0010 str r0, [fp, #-16]
10: e50b1014 str r1, [fp, #-20]
14: e50b2018 str r2, [fp, #-24]
18: e51b0010 ldr r0, [fp, #-16]
1c: e51b1014 ldr r1, [fp, #-20]
20: ebfffffe bl 0 <one>
24: e50b0008 str r0, [fp, #-8]
28: e51b0008 ldr r0, [fp, #-8]
2c: e51b1018 ldr r1, [fp, #-24]
30: ebfffffe bl 0 <two>
34: e50b000c str r0, [fp, #-12]
38: e51b2008 ldr r2, [fp, #-8]
3c: e51b300c ldr r3, [fp, #-12]
40: e0823003 add r3, r2, r3
44: e1a00003 mov r0, r3
48: e24bd004 sub sp, fp, #4
4c: e8bd4800 pop {fp, lr}
50: e12fff1e bx lr
Short answer is the memory is "allocated" both at compile time and at run time. At compile time in the sense that the compiler at compile time determines the size of the stack frame and who goes where. Run time in the sense that the memory itself is on the stack which is a dynamic thing. The stack frame is taken from stack memory at run time, almost like a malloc() and free().
It helps to know the calling convention, x enters in r0, y in r1, z in r2. then x has its home at fp-16, y at fp-20, and z at fp-24. then the call to one() needs x and y so it pulls those from the stack (x and y). the result of one() goes into a which is saved at fp-8 so that is the home for a. and so on.
the function one is not really at address 0, this is a disassembly of an object file not a linked binary. once an object is linked in with the rest of the objects and libraries, the missing parts, like where external functions are, are patched in by the linker and the calls to one() and two() will get real addresses. (and the program will likely not start at address 0).
I cheated here a little, I knew that with no optimizations enabled on the compiler and a relatively simple function like this there really is no reason for a stack frame:
compile with just a little optimization
arm-none-eabi-gcc -O1 -c fun.c -o fun.o
arm-none-eabi-objdump -D fun.o
and the stack frame is gone, the local variables remain in registers.
00000000 :
0: e92d4038 push {r3, r4, r5, lr}
4: e1a05002 mov r5, r2
8: ebfffffe bl 0
c: e1a04000 mov r4, r0
10: e1a01005 mov r1, r5
14: ebfffffe bl 0
18: e0800004 add r0, r0, r4
1c: e8bd4038 pop {r3, r4, r5, lr}
20: e12fff1e bx lr
what the compiler decided to do instead is give itself more registers to work with by saving them on the stack. Why it saved r3 is a mystery, but that is another topic...
entering the function r0 = x, r1 = y and r2 = z per the calling convention, we can leave r0 and r1 alone (try again with one(y,x) and see what happens) since they drop right into one() and are never used again. The calling convention says that r0-r3 can be destroyed by a function, so we need to preserve z for later so we save it in r5. The result of one() is r0 per the calling convention, since two() can destroy r0-r3 we need to save a for later, after the call to two() also we need r0 for the call to two anyway, so r4 now holds a. We saved z in r5 (was in r2 moved to r5) before the call to one, we need the result of one() as the first parameter to two(), and it is already there, we need z as the second so we move r5 where we had saved z to r1, then we call two(). the result of two() per the calling convention. Since b + a = a + b from basic math properties the final add before returning is r0 + r4 which is b + a, and the result goes in r0 which is the register used to return something from a function, per the convention. clean up the stack and restore the modified registers, done.
Since myfun() made calls to other functions using bl, bl modifies the link register (r14), in order to be able to return from myfun() we need the value in the link register to be preserved from the entry into the function to the final return (bx lr), so lr is pushed on the stack. The convention states that we can destroy r0-r3 in our function but not other registers so r4 and r5 are pushed on the stack because we used them. why r3 is pushed on the stack is not necessary from a calling convention perspective, I wonder if it was done in anticipation of a 64 bit memory system, making two full 64 bit writes is cheaper than one 64 bit write and one 32 bit right. but you would need to know the alignment of the stack going in so that is just a theory. There is no reason to preserve r3 in this code.
Now take this knowledge and disassemble the code assigned (arm-...-objdump -D something.something) and do the same kind of analysis. particularly with functions named main() vs functions not named main (I did not use main() on purpose) the stack frame can be a size that doesnt make sense, or less sense than other functions. In the non optimized case above we needed to store 6 things total, x,y,z,a,b and the link register 6*4 = 24 bytes which resulted in sub sp, sp, #24, I need to think about the stack pointer vs frame pointer
thing for a bit. I think there is a command line argument to tell the compiler not to use a frame pointer. -fomit-frame-pointer and it saves a couple of instructions
00000000 <myfun>:
0: e52de004 push {lr} ; (str lr, [sp, #-4]!)
4: e24dd01c sub sp, sp, #28
8: e58d000c str r0, [sp, #12]
c: e58d1008 str r1, [sp, #8]
10: e58d2004 str r2, [sp, #4]
14: e59d000c ldr r0, [sp, #12]
18: e59d1008 ldr r1, [sp, #8]
1c: ebfffffe bl 0 <one>
20: e58d0014 str r0, [sp, #20]
24: e59d0014 ldr r0, [sp, #20]
28: e59d1004 ldr r1, [sp, #4]
2c: ebfffffe bl 0 <two>
30: e58d0010 str r0, [sp, #16]
34: e59d2014 ldr r2, [sp, #20]
38: e59d3010 ldr r3, [sp, #16]
3c: e0823003 add r3, r2, r3
40: e1a00003 mov r0, r3
44: e28dd01c add sp, sp, #28
48: e49de004 pop {lr} ; (ldr lr, [sp], #4)
4c: e12fff1e bx lr
optimizing saves a whole lot more though...
It's going to depend on the program and how exactly they want the location of the variables. Does the question want what code section they're stored in? .const .bss etc? Does it want specific addresses? Either way a good start is using objdump -S flag
objdump -S myprogram > dump.txt
This is nice because it will print out an intermixing of your source code and the assembly with addresses. From here just do a search for your int main and that should get you started.

Resources