GCC computed goto and value of stack pointer - c

In GCC you can use a computed goto by taking the address of a label (as in void *addr = &&label) and then jumping to it (jump *addr). The GCC manual says you can jump to this address from any­where in the function, it's only that jumping to it from another function is undefined.
When you jump to the code it cannot assume anything about the values of registers, so presumably it reloads them from memory. However the value of the stack pointer is also not necessarily defined, for example you could be jumping from a nested scope which declares extra variables.
The question is how does GCC manage to set to value of the stack pointer to the correct value (it may be too high or too low)? And how does this interact with -fomit-frame-pointer (if it does)?
Finally, for extra points, what are the real constraints about where you can jump to a label from? For ex­am­ple, you could probably do it from an interrupt handler.

In general, when you have a function with labels whose address is taken, gcc needs to ensure that you can jump to that label from any indirect goto in the function -- so it needs to layout the stack so that the exact stack pointer doesn't matter (everything is indexed off the frame pointer), or that the stack pointer is consistent across all of them. Generally, this means it allocates a fixed amount of stack space when the function starts and never touches the stack pointer afterwards. So if you have inner scopes with variables, the space will be allocated at function start and freed at function end, not in the inner scope. Only the constructor and destructor (if any) need to be tied to the inner scope.
The only constraint on jumping to labels is the one you noted -- you can only do it from within the function that contains the labels. Not from any other stack frame of any other function or interrupt handler or anything.
edit
If you want to be able to jump from one stack frame to another, you need to use setjmp/longjmp or something similar to unwind the stack. You could combine that with an indirect goto -- something like:
if (target = (void *)setjmp(jmpbuf)) goto *target;
that way you could call longjmp(jmpbuf, label_address); from any called function to unwind the stack and then jump to the label. As long as setjmp/longjmp works from an interrupt handler, this will also work from an interrupt handler. Also depends on sizeof(int) == sizeof(void *), which is not always the case.

I don't think that the fact that the goto's are computed add to the effect that it has on local variables. The lifetime of local variable starts from entering their declaration at or beyond their declaration and ends when the scope of the variable cannot be reached in any way. This includes all different sorts of control flow, in particular goto and longjmp. So all such variables are always safe, until the return from the function in which they are declared.
Labels in C are visible to the whole englobing function, so it makes not much difference if this is a computed goto. You could always replace a computed goto with a more or less involved switch statement.
One notable exception from this rule on local variables are variable length arrays, VLA. Since they do necessarily change the stack pointer, they have different rules. There lifetime ends as soon as you quit their block of declaration and goto and longjmp are not allowed into scopes after a declaration of a variably modified type.

In the the function prologue the current position of the stack is saved in a callee saved register even with -fomit-frame-pointer.
In the below example the sp+4 is stored in r7 and then in the epilogue (LBB0_3) is restored (r7+4 -> r4; r4 -> sp). Because of this you can jump anywhere within the function, grow the stack at any point in the function and not screw up the stack. If you jump out of the function (via jump *addr) you will skip this epilogue and royally screw up the stack.
Short example which also uses alloca which dynamically allocates memory on the stack:
clang -arch armv7 -fomit-frame-pointer -c -S -O0 -o - stack.c
#include <alloca.h>
int foo(int sz, int jmp) {
char *buf = alloca(sz);
int rval = 0;
if( jmp ) {
rval = 1;
goto done;
}
volatile int s = 2;
rval = s * 5;
done:
return rval;
}
and disassembly:
_foo:
# BB#0:
push {r4, r7, lr}
add r7, sp, #4
sub sp, #20
movs r2, #0
movt r2, #0
str r0, [r7, #-8]
str r1, [r7, #-12]
ldr r0, [r7, #-8]
adds r0, #3
bic r0, r0, #3
mov r1, sp
subs r0, r1, r0
mov sp, r0
str r0, [r7, #-16]
str r2, [r7, #-20]
ldr r0, [r7, #-12]
cmp r0, #0
beq LBB0_2
# BB#1:
movs r0, #1
movt r0, #0
str r0, [r7, #-20]
b LBB0_3
LBB0_2:
movs r0, #2
movt r0, #0
str r0, [r7, #-24]
ldr r0, [r7, #-24]
movs r1, #5
movt r1, #0
muls r0, r1, r0
str r0, [r7, #-20]
LBB0_3:
ldr r0, [r7, #-20]
subs r4, r7, #4
mov sp, r4
pop {r4, r7, pc}

Related

ARM Thumb GCC Disassembled C. Caller-saved registers not saved and loading and storing same register immediately

Context: STM32F469 Cortex-M4 (ARMv7-M Thumb-2), Win 10, GCC, STM32CubeIDE; Learning/Trying out inline assembly & reading disassembly, stack managements etc., writing to core registers, observing contents of registers, examining RAM around stack pointer to understand how things work.
I've noticed that at some point, when I call a function, in the beginning of a called function, which received an argument, the instructions generated for the C function do "store R3 at RAM address X" followed immediately "Read RAM address X and store in RAM". So it's writing and reading the same value back, R3 is not changed. If it only had wanted to save the value of R3 onto the stack, why load it back then?
C code, caller function (main), my code:
asm volatile(" LDR R0,=#0x00000000\n"
" LDR R1,=#0x11111111\n"
" LDR R2,=#0x22222222\n"
" LDR R3,=#0x33333333\n"
" LDR R4,=#0x44444444\n"
" LDR R5,=#0x55555555\n"
" LDR R6,=#0x66666666\n"
" MOV R7,R7\n" //Stack pointer value is here, used for stack data access
" LDR R8,=#0x88888888\n"
" LDR R9,=#0x99999999\n"
" LDR R10,=#0xAAAAAAAA\n"
" LDR R11,=#0xBBBBBBBB\n"
" LDR R12,=#0xCCCCCCCC\n"
);
testInt = addFifteen(testInt); //testInt=0x03; returns uint8_t, argument uint8_t
Function call generates instructions to load function argument into R3, then move it to R0, then branch with link to addFifteen. So by the time I enter addFifteen, R0 and R3 have value 0x03 (testInt). So far so good. Here is what function call looks like:
testInt = addFifteen(testInt);
08000272: ldrb r3, [r7, #11]
08000274: mov r0, r3
08000276: bl 0x80001f0 <addFifteen>
So I go into addFifteen, my C code for addFifteen:
uint8_t addFifteen(uint8_t input){
return (input + 15U);
}
and its disassembly:
addFifteen:
080001f0: push {r7}
080001f2: sub sp, #12
080001f4: add r7, sp, #0
080001f6: mov r3, r0
080001f8: strb r3, [r7, #7]
080001fa: ldrb r3, [r7, #7]
080001fc: adds r3, #15
080001fe: uxtb r3, r3
08000200: mov r0, r3
08000202: adds r7, #12
08000204: mov sp, r7
08000206: ldr.w r7, [sp], #4
0800020a: bx lr
My primary interest is in 1f8 and 1fa lines. It stored R3 on stack and then loads freshly written value back into the register that still holds the value anyway.
Questions are:
What is the purpose of this "store register A into RAM X, next read value from RAM X into register A"? Read instruction doesn't seem to serve any purpose. Make sure RAM write is complete?
Push{r7} instruction makes stack 4-byte aligned instead of 8-byte aligned. But immediately after that instruction we have SP decremented by 12 (bytes), so it becomes 8-byte aligned again. Therefore, this behavior is ok. Is this statement correct? What if an interrupt happens between these two instructions? Will alignment be fixed during ISR stacking for the duration of ISR?
From what I read about caller/callee saved registers (very hard to find any sort of well-organized information on that, if you have good material, please, share a link), at least R0-R3 must be placed on stack when I call a function. However, it's easy to notice in this case that NONE of the registers were pushed on stack, and I verified it by checking memory around stack pointer, it would have been easy to notice 0x11111111 and 0x22222222, but they aren't there, and nothing is pushing them there. The values in R0 and R3 that I had before I called the function are simply gone forever. Why weren't any registers pushed on stack before function call? I would expect to have R3 0x33333333 when addFifteen returns because that's how it was before function call, but that value is casually overwritten even before branch to addFifteen. Why didn't GCC generate instructions to push R0-R3 onto the stack and only after that branch with link to addFifteen?
If you need some compiler settings, please, let me know where to find them in Eclipse (STM32CubeIDE) and what exactly you need there, I will happily provide them and add them to the question here.
uint8_t addFifteen(uint8_t input){
return (input + 15U);
}
What you are looking at here is unoptimized and at least with gnu the input and local variables get a memory location on the stack.
00000000 <addFifteen>:
0: b480 push {r7}
2: b083 sub sp, #12
4: af00 add r7, sp, #0
6: 4603 mov r3, r0
8: 71fb strb r3, [r7, #7]
a: 79fb ldrb r3, [r7, #7]
c: 330f adds r3, #15
e: b2db uxtb r3, r3
10: 4618 mov r0, r3
12: 370c adds r7, #12
14: 46bd mov sp, r7
16: bc80 pop {r7}
18: 4770 bx lr
What you see with r3 is that the input variable, input, comes in r0. For some reason, code is not optimized, it goes into r3, then it is saved in its memory location on the stack.
Setup the stack
00000000 <addFifteen>:
0: b480 push {r7}
2: b083 sub sp, #12
4: af00 add r7, sp, #0
save input to the stack
6: 4603 mov r3, r0
8: 71fb strb r3, [r7, #7]
so now we can start implementing the code in the function which wants to do math on the input function, so do that math
a: 79fb ldrb r3, [r7, #7]
c: 330f adds r3, #15
Convert the result to an unsigned char.
e: b2db uxtb r3, r3
Now prepare the return value
10: 4618 mov r0, r3
and clean up and return
12: 370c adds r7, #12
14: 46bd mov sp, r7
16: bc80 pop {r7}
18: 4770 bx lr
Now if I tell it not to use a frame pointer (just a waste of a register).
00000000 <addFifteen>:
0: b082 sub sp, #8
2: 4603 mov r3, r0
4: f88d 3007 strb.w r3, [sp, #7]
8: f89d 3007 ldrb.w r3, [sp, #7]
c: 330f adds r3, #15
e: b2db uxtb r3, r3
10: 4618 mov r0, r3
12: b002 add sp, #8
14: 4770 bx lr
And you can still see each of the fundamental steps in implementing the function. Unoptimized.
Now if you optimize
00000000 <addFifteen>:
0: 300f adds r0, #15
2: b2c0 uxtb r0, r0
4: 4770 bx lr
It removes all the excess.
number two.
Yes I agree this looks wrong, but gnu certainly does not keep the stack on an alignment at all times, so this looks wrong. But I have not read the details on the arm calling convention. Nor have I read to see what gcc's interpretation is. Granted they may claim a spec, but at the end of the day the compiler authors choose the calling convention for their compiler, they are under no obligation to arm or intel or others to conform to any spec. Their choice, and like the C language itself, there are lots of places where it is implementation defined and gnu implements the C language one way and others another way. Perhaps this is the same. Same goes for this saving of the incoming variable to the stack. We will see that llvm/clang does not.
number three.
r0-r3 and another register or two may be called caller saved, but the better way to think of them is volatile. The callee is free to modify them without saving them. It is not so much a case of saving the r0 register, but instead r0 represents a variable and you are managing that variable in functionally implementing the high level code.
For example
unsigned int fun1 ( void );
unsigned int fun0 ( unsigned int x )
{
return(fun1()+x);
}
00000000 <fun0>:
0: b510 push {r4, lr}
2: 4604 mov r4, r0
4: f7ff fffe bl 0 <fun1>
8: 4420 add r0, r4
a: bd10 pop {r4, pc}
x comes in in r0, and we need to preserve that value until after fun1() is called. r0 can be destroyed/modified by fun1(). So in this case they save r4, not r0, and keep x in r4.
clang does this as well
00000000 <fun0>:
0: b5d0 push {r4, r6, r7, lr}
2: af02 add r7, sp, #8
4: 4604 mov r4, r0
6: f7ff fffe bl 0 <fun1>
a: 1900 adds r0, r0, r4
c: bdd0 pop {r4, r6, r7, pc}
Back to your function.
clang, unoptimized also keeps the input variable in memory (stack).
00000000 <addFifteen>:
0: b081 sub sp, #4
2: f88d 0003 strb.w r0, [sp, #3]
6: f89d 0003 ldrb.w r0, [sp, #3]
a: 300f adds r0, #15
c: b2c0 uxtb r0, r0
e: b001 add sp, #4
10: 4770 bx lr
and you can see the same steps, prep the stack, store the input variable. Take the input variable do the math. Prepare the return value. Clean up, return.
Clang/llvm optimized:
00000000 <addFifteen>:
0: 300f adds r0, #15
2: b2c0 uxtb r0, r0
4: 4770 bx lr
Happens to be the same as gnu. Not expected that any two different compilers generate the same code, nor any expectation that any two versions of the same compiler generate the same code.
unoptimized, the input and local variables (none in this case) get a home on the stack. So what you are seeing is the input variable being put in its home on the stack as part of the setup of the function. Then the function itself wants to operate on that variable so, unoptimized, it needs to fetch that value from memory to create an intermediate variable (that in this case did not get a home on the stack) and so on. You see this with volatile variables as well. They will get written to memory then read back then modified then written to memory and read back, etc...
yes I agree, but I have not read the specs. End of the day it is gcc's calling convention or interpretation of some spec they choose to use. They have been doing this (not being aligned 100% of the time) for a long time and it does not fail. For all called functions they are aligned when the functions are called. Interrupts in arm code generated by gcc is not aligned all the time. Been this way since they adopted that spec.
by definition r0-r3, etc are volatile. The callee can modify them at will. The callee only needs to save/preserve them if IT needs them. In both the unoptimized and optimized cases only r0 matters for your function it is the input variable and it is used for the return value. You saw in the function I created that the input variable was preserved for later, even when optimized. But, by definition, the caller assumes these registers are destroyed by called functions, and called functions can destroy the contents of these registers and no need to save them.
As far as inline assembly goes, which is a different assembly language than "real" assembly language. I think you have a ways to go before being ready for that, but maybe not. After decades of constant bare metal work I have found zero real use cases for inline assembly, the cases I see are laziness avoiding allowing real assembly into the make system or ways to avoid writing real assembly language. I see it as a ghee whiz feature that folks use like unions and bitfields.
Within gnu, for arm, you have at least four incompatible assembly languages for arm. The not unified syntax real assembly, the unified syntax real assembly. The assembly language that you see when you use gcc to assemble instead of as and then inline assembly for gcc. Despite claims of compatibility clang arm assembly language is not 100% compatible with gnu assembly language and llvm/clang does not have a separate assembler you feed it to the compiler. Arms various toolchains over the years have completely incompatible assembly language to gnu for arm. This is all expected and normal. Assembly language is specific to the tool not the target.
Before you can get into inline assembly language learn some of the real assembly language. And to be fair perhaps you do, and perhaps quite well, and this question is about the discover of how compilers generate code, and how strange it looks as you find out that it is not some one to one thing (all tools in all cases generate the same output from the same input).
For inline asm, while you can specify registers, depending on what you are doing, you generally want to let the compiler choose the register, most of the work for inline assembly is not the assembly but the language that specific compiler uses to interface it...which is compiler specific, move to another compiler and the expectation is a whole new language to learn. While moving between assemblers is also a whole new language at least the syntax of the instructions themselves tend to be the same and the language differences are in everything else, labels and directives and such. And if lucky and it is a toolchain not just an assembler, you can look at the output of the compiler to start to understand the language and compare it to any documentation you can find. Gnus documentation is pretty bad in this case, so a lot of reverse engineering is needed. At the same time you are more likely to be successful with gnu tools over any other, not because they are better, in many cases they are not, but because of the sheer user base and the common features across targets and over decades of history.
I would get really good at interfacing asm with C by creating mock C functions to see which registers are used, etc. And/or even better, implement it in C, compile it, then hand modify/improve/whatever the output of the compiler (you do not need to be a guru to beat the compiler, to be as consistent, perhaps, but fairly often you can easily see improvements that can be made on the output of gcc, and gcc has been getting worse over the last several versions it is not getting better, as you can see from time to time on this site). Get strong in the asm for this toolchain and target and how the compiler works, and then perhaps learn the gnu inline assembly language.
I'm not sure there is a specific purpose to do it. it is just one solution that the compiler has found to do it.
For example the code:
unsigned int f(unsigned int a)
{
return sqrt(a + 1);
}
compiles with ARM GCC 9 NONE with optimisation level -O0 to:
push {r7, lr}
sub sp, sp, #8
add r7, sp, #0
str r0, [r7, #4]
ldr r3, [r7, #4]
adds r3, r3, #1
mov r0, r3
bl __aeabi_ui2d
mov r2, r0
mov r3, r1
mov r0, r2
mov r1, r3
bl sqrt
...
and in level -O1 to:
push {r3, lr}
adds r0, r0, #1
bl __aeabi_ui2d
bl sqrt
...
As you can see the asm is much easier to understand in -O1: store parameter in R0, add 1, call functions.
The hardware supports non aligned stack during exception. See here
The "caller saved" registers do not necessarily need to be stored on the stack, it's up to the caller to know whether it needs to store them or not.
Here you are mixing (if I understood correctly) C and assembly: so you have to do the compiler job before switching back to C: either you store values in callee saved registers (and then you know by convention that the compiler will store them during function call) or you store them yourself on the stack.

gcc arm optimizes away parameters before System Call

I'm trying to implement some "OSEK-Services" on an arm7tdmi-s using gcc arm. Unfortunately turning up the optimization level results in "wrong" code generation. The main thing I dont understand is that the compiler seems to ignore the procedure call standard, e.g. passing parameters to a function by moving them into registers r0-r3. I understand that function calls can be inlined but still the parameters need to be in the registers to perform the system call.
Consider the following code to demonstrate my problem:
unsigned SysCall(unsigned param)
{
volatile unsigned ret_val;
__asm __volatile
(
"swi 0 \n\t" /* perform SystemCall */
"mov %[v], r0 \n\t" /* move the result into ret_val */
: [v]"=r"(ret_val)
:: "r0"
);
return ret_val; /* return the result */
}
int main()
{
unsigned retCode;
retCode = SysCall(5); // expect retCode to be 6 when returning back to usermode
}
I wrote the Top-Level software interrupt handler in assembly as follows:
.type SWIHandler, %function
.global SWIHandler
SWIHandler:
stmfd sp! , {r0-r2, lr} #save regs
ldr r0 , [lr, #-4] #load sysCall instruction and extract sysCall number
bic r0 , #0xff000000
ldr r3 , =DispatchTable #load dispatchTable
ldr r3 , [r3, r0, LSL #2] #load sysCall address into r3
ldmia sp, {r0-r2} #load parameters into r0-r2
mov lr, pc
bx r3
stmia sp ,{r0-r2} #store the result back on the stack
ldr lr, [sp, #12] #restore return address
ldmfd sp! , {r0-r2, lr} #load result into register
movs pc , lr #back to next instruction after swi 0
The dispatch table looks like this:
DispatchTable:
.word activateTaskService
.word getTaskStateService
The SystemCall function looks like this:
unsigned activateTaskService(unsigned tID)
{
return tID + 1; /* only for demonstration */
}
running without optimization everything works fine and the parameters are in the registers as to be expected:
See following code with -O0 optimization:
00000424 <main>:
424: e92d4800 push {fp, lr}
428: e28db004 add fp, sp, #4
42c: e24dd008 sub sp, sp, #8
430: e3a00005 mov r0, #5 #move param into r0
434: ebffffe1 bl 3c0 <SysCall>
000003c0 <SysCall>:
3c0: e52db004 push {fp} ; (str fp, [sp, #-4]!)
3c4: e28db000 add fp, sp, #0
3c8: e24dd014 sub sp, sp, #20
3cc: e50b0010 str r0, [fp, #-16]
3d0: ef000000 svc 0x00000000
3d4: e1a02000 mov r2, r0
3d8: e50b2008 str r2, [fp, #-8]
3dc: e51b3008 ldr r3, [fp, #-8]
3e0: e1a00003 mov r0, r3
3e4: e24bd000 sub sp, fp, #0
3e8: e49db004 pop {fp} ; (ldr fp, [sp], #4)
3ec: e12fff1e bx lr
Compiling the same code with -O3 results in the following assembly code:
00000778 <main>:
778: e24dd008 sub sp, sp, #8
77c: ef000000 svc 0x00000000 #Inline SystemCall without passing params into r0
780: e1a02000 mov r2, r0
784: e3a00000 mov r0, #0
788: e58d2004 str r2, [sp, #4]
78c: e59d3004 ldr r3, [sp, #4]
790: e28dd008 add sp, sp, #8
794: e12fff1e bx lr
Notice how the systemCall gets inlined without assigning the value 5 t0 r0.
My first approach is to move those values manually into the registers by adapting the function SysCall from above as follows:
unsigned SysCall(volatile unsigned p1)
{
volatile unsigned ret_val;
__asm __volatile
(
"mov r0, %[p1] \n\t"
"swi 0 \n\t"
"mov %[v], r0 \n\t"
: [v]"=r"(ret_val)
: [p1]"r"(p1)
: "r0"
);
return ret_val;
}
It seems to work in this minimal example but Im not very sure whether this is the best possible practice. Why does the compiler think he can omit the parameters when inlining the function? Has somebody any suggestions whether this approach is okay or what should be done differently?
Thank you in advance
A function call in C source code does not instruct the compiler to call the function according to the ABI. It instructs the compiler to call the function according to the model in the C standard, which means the compiler must pass the arguments to the function in a way of its choosing and execute the function in a way that has the same observable effects as defined in the C standard.
Those observable effects do not include setting any processor registers. When a C compiler inlines a function, it is not required to set any particular processor registers. If it calls a function using an ABI for external calls, then it would have to set registers. Inline calls do not need to obey the ABI.
So merely putting your system request inside a function built of C source code does not guarantee that any registers will be set.
For ARM, what you should do is define register variables assigned to the required register(s) and use those as input and output to the assembly instructions:
unsigned SysCall(unsigned param)
{
register unsigned Parameter __asm__("r0") = param;
register unsigned Result __asm__("r0");
__asm__ volatile
(
"swi 0"
: "=r" (Result)
: "r" (Parameter)
: // "memory" // if any inputs are pointers
);
return Result;
}
(This is a major kludge by GCC; it is ugly, and the documentation is poor. But see also https://stackoverflow.com/tags/inline-assembly/info for some links. GCC for some ISAs has convenient specific-register constraints you can use instead of r, but not for ARM.) The register variables do not need to be volatile; the compiler knows they will be used as input and output for the assembly instructions.
The asm statement itself should be volatile if it has side effects other than producing a return value. (e.g. getpid() doesn't need to be volatile.)
A non-volatile asm statement with outputs can be optimized away if the output is unused, or hoisted out of loops if its used with the same input (like a pure function call). This is almost never what you want for a system call.
You also need a "memory" clobber if any of the inputs are pointers to memory that the kernel will read or modify. See How can I indicate that the memory *pointed* to by an inline ASM argument may be used? for more details (and a way to use a dummy memory input or output to avoid a "memory" clobber.)
A "memory" clobber on mmap/munmap or other system calls that affect what memory means would also be wise; you don't want the compiler to decide to do a store after munmap instead of before.

Why doesn't the stack pointer decrease when I am using a 64 bit local variable?

Here is the disassembly code which compiled from C:
00799d60 <sub_799d60>:
799d60: b573 push {r0, r1, r4, r5, r6, lr}
799d62: 0004 movs r4, r0
799d64: f000 e854 blx 799e10 <jmp_sub_100C54>
799d68: 4b15 ldr r3, [pc, #84] ; (799dc0 <sub_799d60+0x60>)
799d6a: 0005 movs r5, r0
799d6c: 4668 mov r0, sp
799d6e: 4798 blx r3
The target of the subroutine call (799d6e: 4798 blx r3) takes a 64 bit integer pointer argument and returns a 64 bit integer. And that routine is a library function, so I am not able to make any modifications on it.
Could this operation overwrite the stack which storages the lr and r6's value?
You say that the branch target "takes a 64 bit integer pointer argument and returns a 64 bit integer", but this is not the case. It takes a pointer to a 64-bit integer as its only argument (and this pointer is 32 bits long unless you're on aarch64, which I doubt given the rest of the code); and it returns nothing, it simply overwrites the 64-bit value pointed to by the argument you passed in. I'm sure this is what you meant, but be careful with your use of terminology, because the difference between these things is important! In particular there is no 64-bit argument passed either into our out of the function you're calling.
On to the question itself. The key to understanding what the compiler is doing here is to look at the very first line:
push {r0, r1, r4, r5, r6, lr}
The ARM calling convention doesn't require r0 and r1 to be call-preserved, so what are they doing in the list? The answer is that the compiler has added these 'dummy' pushes to create some space on the stack. The push operation above is essentially equivalent to
push {r4, r5, r6, lr}
sub sp, sp, #0x08
except that it saves an instruction. The result is not quite the same, of course, because whatever was in r0 and r1 ends up being written to these locations; but given that there's no way to know what was there beforehand, and the stacked values are about to get overwritten anyway, it's of no consequence. So we have, as a stack frame,
lr
r6
r5
r4
(r1)
sp -> (r0)
with the stack pointer pointing at the space created by the dummy push of r0 and r1. Now we just have
mov r0, sp
which copies the stack pointer to r0 to use as the pointer argument to the function you're calling, which will then overwrite the two words at this location to result in a stack frame of
lr
r6
r5
r4
(64-bit value, high word)
sp -> (64-bit value, low word)
You haven't shown any code beyond the blx r3, so it's not possible to say exactly what happens to the stack at the end of the function. But if this function returns no arguments, I would expect to see a matching
pop {r0, r1, r4, r5, r6, pc}
which will, of course, result in your 64-bit result being left in r0 and r1. But these registers are call-clobbered according to the calling convention so there's no problem.

Optimize C or assembly code in size for Cortex-M0

I need to reduce the code bloat for the Cortex-M0 microprocessor.
At startup the ROM data has to be copied to the RAM data once. Therefore I have this piece of code:
void __startup( void ){
extern unsigned int __data_init_start;
extern unsigned int __data_start;
extern unsigned int __data_end;
// copy .data section from flash to ram
s = & __data_init_start;
d = & __data_start;
e = & __data_end;
while( d != e ){
*d++ = *s++;
}
}
The assembly code that is generated by the compiler looks like this:
ldr r1, .L10+8
ldr r2, .L10+12
sub r0, r1, r2
lsr r3, r0, #2
add r3, r3, #1
lsl r1, r3, #2
mov r3, #0
.L4:
add r3, r3, #4
cmp r3, r1
beq .L9
.L5:
ldr r4, .L10+16
add r0, r2, r3
add r4, r3, r4
sub r4, r4, #4
ldr r4, [r4]
sub r0, r0, #4
str r4, [r0]
b .L4
How can I optimize this code so the code size is at minimum?
The compiler (or you!) does not realize that the range to copy is end - start. There seems to be some unnecessarily shuffling of data going on -- the 2 add and the sub in the loop. Also, it seems to me the compiler makes sure that the number of copies to make is a multiple of 4. An obvious optimization, then, is to make sure it is in advance! Below I assume it is (if not, the bne will fail and happily keep on copying and trample all over your memory).
Using my decade-old ARM assembler knowlegde (yes, that is a major disclaimer), and post-incrementing, I think the following short snippet is what it can be condensed to. From 18 instructions down to 8, not too bad. If it works.
ldr r1, __data_init_start
ldr r2, __data_start
ldr r3, __data_end
sub r4, r3, r2
.L1:
ldr r3, [r1], #4 ; safe to re-use r3 here
str r3, [r2], #4
subs r4, r4, #4
bne L1
May be that platform guarantees that writing to an unsigned int * you may change an unsigned int * value (i.e. it doesn't take advantage of type mismatch aliasing rules).
Then the code is inefficient because e is a global variable and the generated code logic must take in account that writing to *d may change the value of e.
Making at least e a local should solve this problem (most compilers know that aliasing a local that never had its address taken is not possible from a C point of view).

Local variable location in memory

For a homework assignment I have been given some c files, and compiled them using arm-linux-gcc (we will eventually be targeting gumstix boards, but for these exercises we have been working with qemu and ema).
One of the questions confuses me a bit-- we are told to:
Use arm-linux-objdump to find the location of variables declared in main() in the executable binary.
However, these variables are local and thus shouldn't have addresses until runtime, correct?
I'm thinking that maybe what I need to find is the offset in the stack frame, which can in fact be found using objdump (not that I know how).
Anyways, any insight into the matter would be greatly appreciated, and I would be happy to post the source code if necessary.
unsigned int one ( unsigned int, unsigned int );
unsigned int two ( unsigned int, unsigned int );
unsigned int myfun ( unsigned int x, unsigned int y, unsigned int z )
{
unsigned int a,b;
a=one(x,y);
b=two(a,z);
return(a+b);
}
compile and disassemble
arm-none-eabi-gcc -c fun.c -o fun.o
arm-none-eabi-objdump -D fun.o
code created by compiler
00000000 <myfun>:
0: e92d4800 push {fp, lr}
4: e28db004 add fp, sp, #4
8: e24dd018 sub sp, sp, #24
c: e50b0010 str r0, [fp, #-16]
10: e50b1014 str r1, [fp, #-20]
14: e50b2018 str r2, [fp, #-24]
18: e51b0010 ldr r0, [fp, #-16]
1c: e51b1014 ldr r1, [fp, #-20]
20: ebfffffe bl 0 <one>
24: e50b0008 str r0, [fp, #-8]
28: e51b0008 ldr r0, [fp, #-8]
2c: e51b1018 ldr r1, [fp, #-24]
30: ebfffffe bl 0 <two>
34: e50b000c str r0, [fp, #-12]
38: e51b2008 ldr r2, [fp, #-8]
3c: e51b300c ldr r3, [fp, #-12]
40: e0823003 add r3, r2, r3
44: e1a00003 mov r0, r3
48: e24bd004 sub sp, fp, #4
4c: e8bd4800 pop {fp, lr}
50: e12fff1e bx lr
Short answer is the memory is "allocated" both at compile time and at run time. At compile time in the sense that the compiler at compile time determines the size of the stack frame and who goes where. Run time in the sense that the memory itself is on the stack which is a dynamic thing. The stack frame is taken from stack memory at run time, almost like a malloc() and free().
It helps to know the calling convention, x enters in r0, y in r1, z in r2. then x has its home at fp-16, y at fp-20, and z at fp-24. then the call to one() needs x and y so it pulls those from the stack (x and y). the result of one() goes into a which is saved at fp-8 so that is the home for a. and so on.
the function one is not really at address 0, this is a disassembly of an object file not a linked binary. once an object is linked in with the rest of the objects and libraries, the missing parts, like where external functions are, are patched in by the linker and the calls to one() and two() will get real addresses. (and the program will likely not start at address 0).
I cheated here a little, I knew that with no optimizations enabled on the compiler and a relatively simple function like this there really is no reason for a stack frame:
compile with just a little optimization
arm-none-eabi-gcc -O1 -c fun.c -o fun.o
arm-none-eabi-objdump -D fun.o
and the stack frame is gone, the local variables remain in registers.
00000000 :
0: e92d4038 push {r3, r4, r5, lr}
4: e1a05002 mov r5, r2
8: ebfffffe bl 0
c: e1a04000 mov r4, r0
10: e1a01005 mov r1, r5
14: ebfffffe bl 0
18: e0800004 add r0, r0, r4
1c: e8bd4038 pop {r3, r4, r5, lr}
20: e12fff1e bx lr
what the compiler decided to do instead is give itself more registers to work with by saving them on the stack. Why it saved r3 is a mystery, but that is another topic...
entering the function r0 = x, r1 = y and r2 = z per the calling convention, we can leave r0 and r1 alone (try again with one(y,x) and see what happens) since they drop right into one() and are never used again. The calling convention says that r0-r3 can be destroyed by a function, so we need to preserve z for later so we save it in r5. The result of one() is r0 per the calling convention, since two() can destroy r0-r3 we need to save a for later, after the call to two() also we need r0 for the call to two anyway, so r4 now holds a. We saved z in r5 (was in r2 moved to r5) before the call to one, we need the result of one() as the first parameter to two(), and it is already there, we need z as the second so we move r5 where we had saved z to r1, then we call two(). the result of two() per the calling convention. Since b + a = a + b from basic math properties the final add before returning is r0 + r4 which is b + a, and the result goes in r0 which is the register used to return something from a function, per the convention. clean up the stack and restore the modified registers, done.
Since myfun() made calls to other functions using bl, bl modifies the link register (r14), in order to be able to return from myfun() we need the value in the link register to be preserved from the entry into the function to the final return (bx lr), so lr is pushed on the stack. The convention states that we can destroy r0-r3 in our function but not other registers so r4 and r5 are pushed on the stack because we used them. why r3 is pushed on the stack is not necessary from a calling convention perspective, I wonder if it was done in anticipation of a 64 bit memory system, making two full 64 bit writes is cheaper than one 64 bit write and one 32 bit right. but you would need to know the alignment of the stack going in so that is just a theory. There is no reason to preserve r3 in this code.
Now take this knowledge and disassemble the code assigned (arm-...-objdump -D something.something) and do the same kind of analysis. particularly with functions named main() vs functions not named main (I did not use main() on purpose) the stack frame can be a size that doesnt make sense, or less sense than other functions. In the non optimized case above we needed to store 6 things total, x,y,z,a,b and the link register 6*4 = 24 bytes which resulted in sub sp, sp, #24, I need to think about the stack pointer vs frame pointer
thing for a bit. I think there is a command line argument to tell the compiler not to use a frame pointer. -fomit-frame-pointer and it saves a couple of instructions
00000000 <myfun>:
0: e52de004 push {lr} ; (str lr, [sp, #-4]!)
4: e24dd01c sub sp, sp, #28
8: e58d000c str r0, [sp, #12]
c: e58d1008 str r1, [sp, #8]
10: e58d2004 str r2, [sp, #4]
14: e59d000c ldr r0, [sp, #12]
18: e59d1008 ldr r1, [sp, #8]
1c: ebfffffe bl 0 <one>
20: e58d0014 str r0, [sp, #20]
24: e59d0014 ldr r0, [sp, #20]
28: e59d1004 ldr r1, [sp, #4]
2c: ebfffffe bl 0 <two>
30: e58d0010 str r0, [sp, #16]
34: e59d2014 ldr r2, [sp, #20]
38: e59d3010 ldr r3, [sp, #16]
3c: e0823003 add r3, r2, r3
40: e1a00003 mov r0, r3
44: e28dd01c add sp, sp, #28
48: e49de004 pop {lr} ; (ldr lr, [sp], #4)
4c: e12fff1e bx lr
optimizing saves a whole lot more though...
It's going to depend on the program and how exactly they want the location of the variables. Does the question want what code section they're stored in? .const .bss etc? Does it want specific addresses? Either way a good start is using objdump -S flag
objdump -S myprogram > dump.txt
This is nice because it will print out an intermixing of your source code and the assembly with addresses. From here just do a search for your int main and that should get you started.

Resources