In my source code I saw a weird behavior of arm compiler where it did redundant iteration over a string, which unnecessary. I display here a minimal example that shows that,and ask my question below that
#include <string.h>
#define MIN(x, y) (((x) < (y)) ? (x) : (y))
int MAX_FILE_NAME = 2500;
int F(char *file){
int file_len = MIN(strlen(file), MAX_FILE_NAME - 1);
return file_len;
}
int main(void) {
F(__FILE__);
return 0 ;
}
compiled with:
arm-none-eabi-gcc -nostdlib -Xlinker -Map="m7_experiments.map" -Xlinker --cref -Xlinker --gc-sections -Xlinker -print-memory-usage -mcpu=cortex-m7 -mfpu=fpv5-sp-d16 -mfloat-abi=hard -mthumb -T "m7_experiments_Debug.ld" -o "m7_experiments.axf" ./src/cr_startup_cm7.o ./src/crp.o ./src/flashconfig.o ./src/m7_experiments.o
Leads to:
Dump of assembler code for function F:
0x00000104 <+0>: push {r4, lr}
0x00000106 <+2>: mov r4, r0
0x00000108 <+4>: bl 0x13c <strlen>
0x0000010c <+8>: mov r2, r0
0x0000010e <+10>: ldr r3, [pc, #20] ; (0x124 <F+32>)
0x00000110 <+12>: ldr r0, [r3, #0]
0x00000112 <+14>: subs r0, #1
0x00000114 <+16>: cmp r2, r0
0x00000116 <+18>: bcc.n 0x11a <F+22>
0x00000118 <+20>: pop {r4, pc}
0x0000011a <+22>: mov r0, r4
0x0000011c <+24>: bl 0x13c <strlen>
0x00000120 <+28>: b.n 0x118 <F+20>
0x00000122 <+30>: nop
0x00000124 <+32>: lsls r0, r3, #6
0x00000126 <+34>: movs r0, r0
Note how in the case that the file length is shorter than the defined one, instead of just getting it's length from $r2 it's being computed again, worsening the time run to be as long as 2* file length. which seems unnecessary. Is there some way to justify the compiler behavior in this case? I'm interested to know.
It is redundant. But that is because of your code, not the compiler. That macro is going to expand to this:
// x = strlen(file)
// y = MAX_FILE_NAME - 1
(((strlen(file)) < (MAX_FILE_NAME - 1)) ? (strlen(file)) : (MAX_FILE_NAME - 1))
Remember, the preprocessor is essentially just a glorified copy and paste machine. You're calling strlen twice. Try this:
size_t file_len = strlen(file);
file_len = MIN(file_len, MAX_FILE_NAME - 1);
Is there some way to justify the compiler behavior in this case? I'm interested to know.
The compiler is playing it safe.
With higher levels of optimization, the compiler uses inside knowledge of strlen() and "knows" strlen(file) will return the same value with a 2nd call.
Consider:
int file_len = MIN(rand(), MAX_FILE_NAME - 1);
MIN() might not return the minimum even with optimizations enabled as it should call rand() a 2nd time, if the first was less.
Consider:
int file_len = MIN(some_user_funciton(file), MAX_FILE_NAME - 1);
Compiler likely has little clue about some_user_funciton(file) and so calls some_user_funciton(file) 2nd time when needed.
Related
If a variable is not specified with the keyword volatile, the compiler likely does caching. The variable must be accessed from memory always otherwise until its transaction unit ends. The point I wonder lies in assembly part.
int main() {
/* volatile */ int lock = 999;
while (lock);
}
On x86-64-clang-3.0.0 compiler, its assembly code is following.
main: # #main
mov DWORD PTR [RSP - 4], 0
mov DWORD PTR [RSP - 8], 999
.LBB0_1: # =>This Inner Loop Header: Depth=1
cmp DWORD PTR [RSP - 8], 0
je .LBB0_3
jmp .LBB0_1
.LBB0_3:
mov EAX, DWORD PTR [RSP - 4]
ret
When volatile keyword is commented in, it turns out the following.
main: # #main
mov DWORD PTR [RSP - 4], 0
mov DWORD PTR [RSP - 8], 999
.LBB0_1: # =>This Inner Loop Header: Depth=1
mov EAX, DWORD PTR [RSP - 8]
cmp EAX, 0
je .LBB0_3
jmp .LBB0_1
.LBB0_3:
mov EAX, DWORD PTR [RSP - 4]
ret
The points I wonder and don't understand,
cmp DWORD PTR [RSP - 8], 0 . <---
Why is the comparison done with 0 whilst DWORD PTR [RSP - 8] holds 999 within ?
Why is DWORD PTR [RSP - 8] copied into EAX and again why is the comparison done between 0 and EAX?
It looks like you forgot to enable optimization. -O0 treats all variables (except register variables) pretty similarly to volatile for consistent debugging.
With optimization enabled, compilers can hoist non-volatile loads out of loops. while(locked); will compile similarly to source like
if (locked) {
while(1){}
}
Or since locked has a compile-time-constant initializer, the whole function should compile to jmp main (an infinite loop).
See MCU programming - C++ O2 optimization breaks while loop for more details.
Why is DWORD PTR [RSP - 8] copied into EAX and again why is the comparison done between 0 and EAX?
Some compilers are worse at folding loads into memory operands for other instructions when you use volatile. I think that's why you're getting a separate mov load here; it's just a missed optimization.
(Although cmp [mem], imm might be less efficient. I forget if it can macro-fuse with a JCC or something. With a RIP-relative addressing mode it couldn't micro-fuse the load, but a register base is ok.)
cmp EAX, 0 is weird, I guess clang with optimization disabled doesn't look for test eax,eax as a peephole optimization for comparing against zero.
As #user3386109 commented, locked in a boolean context is equivalent to locked != 0 in C / C++.
The compiler doesn't know about caching, it is not a caching thing, it tells the compiler that the value may change between accesses. So to functionally implement our code it needs to perform the accesses we ask for in the order we ask them. Can't optimize out.
void fun1 ( void )
{
/* volatile */ int lock = 999;
while (lock) continue;
}
void fun2 ( void )
{
volatile int lock = 999;
while (lock) continue;
}
volatile int vlock;
int ulock;
void fun3 ( void )
{
while(vlock) continue;
}
void fun4 ( void )
{
while(ulock) continue;
}
void fun5 ( void )
{
vlock=3;
vlock=4;
}
void fun6 ( void )
{
ulock=3;
ulock=4;
}
I find it easier to see in arm... doesn't really matter.
Disassembly of section .text:
00001000 <fun1>:
1000: eafffffe b 1000 <fun1>
00001004 <fun2>:
1004: e59f3018 ldr r3, [pc, #24] ; 1024 <fun2+0x20>
1008: e24dd008 sub sp, sp, #8
100c: e58d3004 str r3, [sp, #4]
1010: e59d3004 ldr r3, [sp, #4]
1014: e3530000 cmp r3, #0
1018: 1afffffc bne 1010 <fun2+0xc>
101c: e28dd008 add sp, sp, #8
1020: e12fff1e bx lr
1024: 000003e7 andeq r0, r0, r7, ror #7
00001028 <fun3>:
1028: e59f200c ldr r2, [pc, #12] ; 103c <fun3+0x14>
102c: e5923000 ldr r3, [r2]
1030: e3530000 cmp r3, #0
1034: 012fff1e bxeq lr
1038: eafffffb b 102c <fun3+0x4>
103c: 00002000
00001040 <fun4>:
1040: e59f3014 ldr r3, [pc, #20] ; 105c <fun4+0x1c>
1044: e5933000 ldr r3, [r3]
1048: e3530000 cmp r3, #0
104c: 012fff1e bxeq lr
1050: e3530000 cmp r3, #0
1054: 012fff1e bxeq lr
1058: eafffffa b 1048 <fun4+0x8>
105c: 00002004
00001060 <fun5>:
1060: e3a01003 mov r1, #3
1064: e3a02004 mov r2, #4
1068: e59f3008 ldr r3, [pc, #8] ; 1078 <fun5+0x18>
106c: e5831000 str r1, [r3]
1070: e5832000 str r2, [r3]
1074: e12fff1e bx lr
1078: 00002000
0000107c <fun6>:
107c: e3a02004 mov r2, #4
1080: e59f3004 ldr r3, [pc, #4] ; 108c <fun6+0x10>
1084: e5832000 str r2, [r3]
1088: e12fff1e bx lr
108c: 00002004
Disassembly of section .bss:
00002000 <vlock>:
2000: 00000000
00002004 <ulock>:
2004: 00000000
First one is the most telling:
00001000 <fun1>:
1000: eafffffe b 1000 <fun1>
Being a local variable that is initialized, and non volatile then the compiler can assume it won't change value between accesses so it can never change in the while loop, so this is essentially a while 1 loop. If the initial value had been zero this would be a simple return as it can never be non-zero, being non-volatile.
fun2 being a local variable a stack frame needs to be built then.
It does what one assumes the code was trying to do, wait for this shared variable, one that can change during the loop
1010: e59d3004 ldr r3, [sp, #4]
1014: e3530000 cmp r3, #0
1018: 1afffffc bne 1010 <fun2+0xc>
so it samples it and tests what it samples each time through the loop.
fun3 and fun4 same deal but more realistic, as external to the function code isnt going to change lock, being non-global doesn't make much sense for your while loop.
102c: e5923000 ldr r3, [r2]
1030: e3530000 cmp r3, #0
1034: 012fff1e bxeq lr
1038: eafffffb b 102c <fun3+0x4>
For the volatile fun3 case the variable has to be read and tested each loop
1044: e5933000 ldr r3, [r3]
1048: e3530000 cmp r3, #0
104c: 012fff1e bxeq lr
1050: e3530000 cmp r3, #0
1054: 012fff1e bxeq lr
1058: eafffffa b 1048 <fun4+0x8>
For the non-volatile being global it has to sample it once, very interesting what the compiler did here, have to think about why it would do that, but either way you can see that the "loop" retests the value read stored in a register (not cached) which will never change with a proper program. Functionally we asked it to only read the variable once by using non-volatile then it tests that value indefinitely.
fun5 and fun6 further demonstrate that volatile requires the compiler perform the accesses to the variable in its storage place before moving on to the next operation/access in the code. So when volatile we are asking the compiler to perform two assignments, two stores. When non-volatile the compiler can optimize out the first store and only do the last one as if you look at the code as a whole this function (fun6) leaves the variable set to 4, so the function leaves the variable set to 4.
The x86 solution is equally interesting repz retq is all over it (with the compiler on my computer), not hard to find out what that is all about.
Neither aarch64, x86, mips, riscv, msp430, pdp11 backends do the double check on fun3().
pdp11 is actually the easier code to read (no surprise there)
00000000 <_fun1>:
0: 01ff br 0 <_fun1>
00000002 <_fun2>:
2: 65c6 fffe add $-2, sp
6: 15ce 03e7 mov $1747, (sp)
a: 1380 mov (sp), r0
c: 02fe bne a <_fun2+0x8>
e: 65c6 0002 add $2, sp
12: 0087 rts pc
00000014 <_fun3>:
14: 1dc0 0026 mov $3e <_vlock>, r0
18: 02fd bne 14 <_fun3>
1a: 0087 rts pc
0000001c <_fun4>:
1c: 1dc0 001c mov $3c <_ulock>, r0
20: 0bc0 tst r0
22: 02fe bne 20 <_fun4+0x4>
24: 0087 rts pc
00000026 <_fun5>:
26: 15f7 0003 0012 mov $3, $3e <_vlock>
2c: 15f7 0004 000c mov $4, $3e <_vlock>
32: 0087 rts pc
00000034 <_fun6>:
34: 15f7 0004 0002 mov $4, $3c <_ulock>
3a: 0087 rts pc
(this is the not linked version)
cmp DWORD PTR [RSP - 8], 0 . <--- Why is the comparison done with 0 whilst DWORD PTR [RSP - 8] holds 999 within ?
while does a true false comparison meaning is it equal to zero or not equal to zero
Why is DWORD PTR [RSP - 8] copied into EAX and again why is the comparison done between 0 and EAX?
mov -0x8(%rsp),%eax
cmp 0,%eax
cmp 0,-0x8(%rsp)
as so.s -o so.o
so.s: Assembler messages:
so.s:3: Error: too many memory references for `cmp'
compare wants a register. So it reads into a register so it can do the compare as it can't do the compare between the immediate and the memory access in one instruction. If they could have done it in one instruction they would have.
I was checking some gcc generated assembly for ARM and noticed that I get strange results if I use designated initializers:
E.g. if I have this code:
struct test
{
int x;
int y;
};
__attribute__((noinline))
struct test get_struct_1(void)
{
struct test x;
x.x = 123456780;
x.y = 123456781;
return x;
}
__attribute__((noinline))
struct test get_struct_2(void)
{
return (struct test){ .x = 123456780, .y = 123456781 };
}
I get the following output with gcc -O2 -std=C11 for ARM (ARM GCC 6.3.0):
get_struct_1:
ldr r1, .L2
ldr r2, .L2+4
stm r0, {r1, r2}
bx lr
.L2:
.word 123456780
.word 123456781
get_struct_2: // <--- what is happening here
mov r3, r0
ldr r2, .L5
ldm r2, {r0, r1}
stm r3, {r0, r1}
mov r0, r3
bx lr
.L5:
.word .LANCHOR0
I can see the constants for the first function, but I don't understand how get_struct_2 works.
If I compile for x86, both functions just load the same single 64-bit value in a single instruction.
get_struct_1:
movabs rax, 530242836987890956
ret
get_struct_2:
movabs rax, 530242836987890956
ret
Am I provoking some undefined behavior, or is this .LANCHOR0 somehow related to these constants?
Looks like gcc shoots itself in the foot with an extra level of indirection after merging the loads of the constants into an ldm.
No idea why, but pretty obviously a missed optimization bug.
x86-64 is easy to optimize for; the entire 8-byte constant can go in one immediate. But ARM often uses PC-relative loads for constants that are too big for one immediate.
I was looking at a arm assembly code generated by gcc, and I noticed that the GCC compiled a function with the following code:
0x00010504 <+0>: push {r7, lr}
0x00010506 <+2>: sub sp, #24
0x00010508 <+4>: add r7, sp, #0
0x0001050a <+6>: str r0, [r7, #4]
=> 0x0001050c <+8>: mov r3, lr
0x0001050e <+10>: mov r1, r3
0x00010510 <+12>: movw r0, #1664 ; 0x680
0x00010514 <+16>: movt r0, #1
0x00010518 <+20>: blx 0x10378 <printf#plt>
0x0001051c <+24>: add.w r3, r7, #12
0x00010520 <+28>: mov r0, r3
0x00010522 <+30>: blx 0x10384 <gets#plt>
0x00010526 <+34>: mov r3, lr
0x00010528 <+36>: mov r1, r3
0x0001052a <+38>: movw r0, #1728 ; 0x6c0
0x0001052e <+42>: movt r0, #1
0x00010532 <+46>: blx 0x10378 <printf#plt>
0x00010536 <+50>: adds r7, #24
0x00010538 <+52>: mov sp, r7
0x0001053a <+54>: pop {r7, pc}
The thing which was interesting for me was that, I see the GCC uses R7 to pop the values to PC instead of LR. I saw similar thing with R11. The compiler push the r11 and LR to the stack and then pop the R11 to the PC. should not LR act as return address instead of R7 or R11. Why does the R7 (which is a frame pointer in Thumb Mode) being used here?
If you look at apple ios calling convention it is even different. It uses other registers (e.g. r4 to r7) to PC to return the control. Should not it use LR?
Or I am missing something here?
Another question is that, it looks like that the LR, R11 or R7 values are never an immediate value to the return address. But a pointer to the stack which contain the return address. Is that right?
Another weird thing is that compiler does not do the same thing for function epoilogue. For example it might instead of using pop to PC use bx LR, but Why?
Well first off they likely want to keep the stack aligned on a 64 bit boundary.
R7 is better than anything greater for a frame pointer as registers r8 to r15 are not supported in most instructions. I would have to look I would assume there are special pc and sp offset load/store instructions so why would r7 be burned at all?
Not sure all you are asking, in thumb you can push lr but pop pc and I think that is equivalent to bx lr, but you have to look it up for each architecture as for some you cannot switch modes with pop. In this case it appears to assume that and not burn the extra instruction with a pop r3 bx r3 kind of thing. And actually to have done that would have likely needed to be two extra instructions pop r7, pop r3, bx r3.
So it may be a case that one compiler is told what architecture is being used and can assume pop pc is safe where another is not so sure. Again have to read the arm architecture docs for various architectures to know the variations on what instructions can be used to change modes and what cant. Perhaps if you walk through various architecture types with gnu it may change the way it returns.
EDIT
unsigned int morefun ( unsigned int, unsigned int );
unsigned int fun ( unsigned int x, unsigned int y )
{
x+=1;
return(morefun(x,y+2)+3);
}
arm-none-eabi-gcc -O2 -mthumb -c so.c -o so.o
arm-none-eabi-objdump -D so.o
00000000 <fun>:
0: b510 push {r4, lr}
2: 3102 adds r1, #2
4: 3001 adds r0, #1
6: f7ff fffe bl 0 <morefun>
a: 3003 adds r0, #3
c: bc10 pop {r4}
e: bc02 pop {r1}
10: 4708 bx r1
12: 46c0 nop ; (mov r8, r8)
arm-none-eabi-gcc -O2 -mthumb -mcpu=cortex-m3 -march=armv7-m -c so.c -o so.o
arm-none-eabi-objdump -D so.o
00000000 <fun>:
0: b508 push {r3, lr}
2: 3102 adds r1, #2
4: 3001 adds r0, #1
6: f7ff fffe bl 0 <morefun>
a: 3003 adds r0, #3
c: bd08 pop {r3, pc}
e: bf00 nop
just using that march without the mcpu gives the same result (doesnt pop the lr to r1 to bx).
march=armv5t changes it up slightly
00000000 <fun>:
0: b510 push {r4, lr}
2: 3102 adds r1, #2
4: 3001 adds r0, #1
6: f7ff fffe bl 0 <morefun>
a: 3003 adds r0, #3
c: bd10 pop {r4, pc}
e: 46c0 nop ; (mov r8, r8)
armv4t as expected does the pop and bx thing.
armv6-m gives what armv5t gave.
gcc version 6.1.0 built using --target=arm-none-eabi without any other arm specifier.
So likely as the OP is asking if I understand right they are probably seeing the three instruction pop pop bx rather than a single pop {rx,pc}. Or at least one compiler varies compared to another. Apple IOS was mentioned so it likely defaults to a heavier duty core than a works everywhere type of thing. And their gcc like mine defaults to the work everywhere (including the original ARMv4T) rather than work everywhere but the original. I assume if you add some command line options you will see the gcc compiler behave differently as I have demonstrated.
Note in these examples r3 and r4 are not used, why are they preserving them then? It is likely the first thing I mentioned keeping a 64 bit alignment on the stack. If for the all thumb variants solution if you get an interrupt between the pops then the interrupt handler is dealing with an unaligned stack. Since r4 was throwaway anyway they could have popped r1 and r2 or r2 and r3 and then bx r2 or bx r3 respectively and not had that moment where it was unaligned and saved an instruction. Oh well...
here is a c source code example:
register int a asm("r8");
register int b asm("r9");
int main() {
int c;
a=2;
b=3;
c=a+b;
return c;
}
And this is the assembled code generated using a arm gcc cross compiler:
$ arm-linux-gnueabi-gcc -c global_reg_var_test.c -Wa,-a,-ad
...
mov r8, #2
mov r9, #3
mov r2, r8
mov r3, r9
add r3, r2, r3
...
When using -frename-registers, the behaviour was the same. (updated. Before I had said with -O3.)
So the question is: why gcc add the 3rd and 4th MOV's instead of 'ADD R3, R8, R9'?
Context: I need to optimize a code in a simulated inorder cpu (gem5 arm minorcpu) that doesn't rename registers.
I took real example (posted in comments) and put it on the godbolt compiler explorer. The main inefficiency in calc() is that src1 and src2 are globals it has to load from memory, instead of args passed in registers.
I didn't look at main, just calc.
register int sum asm ("r4");
register int r asm ("r5");
register int c asm ("r6");
register int k asm ("r7");
register int temp1 asm ("r8"); // really? you're using two global register vars for scratch temporaries? Just let the compiler do its job.
register int temp2 asm ("r9");
register long n asm ("r10");
int *src1, *src2, *dst;
void calc() {
temp1 = r*n;
temp2 = k*n;
temp1 = temp1+k;
temp2 = temp2+c;
// you get bad code for this because src1 and src2 are globals, not args passed in regs
sum = sum + src1[temp1] * src2[temp2];
}
# gcc 4.8.2 -O3 -Wall -Wextra -Wa,-a,-ad -fverbose-asm
mla r0, r10, r7, r6 # temp2.9, n, k, c ## tmp = k*n + c
movw r3, #:lower16:.LANCHOR0 # tmp136,
mla r8, r10, r5, r7 # temp1, n, r, k ## temp1 = r*n + k
movt r3, #:upper16:.LANCHOR0 # tmp136,
ldmia r3, {r1, r2} # tmp136,, ## load both pointers, since they're stored adjacently in memory
mov r9, r0 # temp2, temp2.9 ## This insn is wasted: the first MLA should have had this as the dest
ldr r3, [r1, r8, lsl #2] # *_22, *_22
ldr r2, [r2, r9, lsl #2] # *_28, *_28
mla r4, r2, r3, r4 # sum, *_28, *_22, sum
bx lr #
For some reason, one of the integer multiply-accumulate (mla) instructions uses r8 (temp1) as the destination, but the other one writes to r0 (a scratch reg), and only later moves the result to r9 (temp2).
The sum += src1[temp1] * src2[temp2] is done with an mla that reads and writes r4 (sum).
Why do you need temp1 and temp2 to be globals? That's just going to stop the optimizer from doing aggressive optimizations that don't calculate exactly the same temporaries that the C source does. Fortunately the C memory model is weak enough that it should be able to reorder assignments to them, although this might actually be why it didn't MLA into temp2 directly, since it decided to do that calculation first. (Hmm, does the memory model even apply? Other threads can't see our registers at all, so those globals are all effectively thread-local. It should allow relaxed ordering for assignments to globals. Signal handlers can see these globals, and could run at any point. gcc isn't following strict source order, since in the source both multiplies happen before either add.)
Godbolt doesn't have a newer ARM gcc version, so I can't easily test a newer gcc. A newer gcc might do a better job with this.
BTW, I tried a version of the function using local variables for temporaries, and didn't actually get better results. Probably because there are still so many register globals that gcc couldn't pick convenient regs for the temporaries.
// same register globals, except for temp1 and temp2.
void calc_local_tmp() {
int t1 = r*n + k;
sum += src1[t1] * src2[k*n + c];
}
push {lr} # gcc decides to push to get a tmp reg
movw r3, #:lower16:.LANCHOR0 # tmp131,
mla lr, r10, r5, r7 # tmp133, n.1, r, k.2
movt r3, #:upper16:.LANCHOR0 # tmp131,
mla ip, r7, r10, r6 # tmp137, k.2, n.1, c
ldr r2, [r3] # src1, src1
ldr r0, [r3, #4] # src2, src2
ldr r1, [r2, lr, lsl #2] # *_10, *_10
ldr r3, [r0, ip, lsl #2] # *_20, *_20
mla r4, r3, r1, r4 # sum, *_20, *_10, sum
ldr pc, [sp], #4 #
Compiling with -fcall-used-r8 -fcall-used-r9 didn't help; gcc makes the same code that pushes lr to get an extra temporary. It fails to use ldmia (load-multiple) because it makes a sub-optimal choice of which temporary to put in which reg. (&src1 in r0 would let it load src1 and src2 into r2 and r3.)
This question already has answers here:
ARM: Why do I need to push/pop two registers at function calls?
(3 answers)
Closed 12 months ago.
I tried to write a simple test code like this(main.c):
main.c
void test(){
}
void main(){
test();
}
Then I used arm-non-eabi-gcc to compile and objdump to get the assembly code:
arm-none-eabi-gcc -g -fno-defer-pop -fomit-frame-pointer -c main.c
arm-none-eabi-objdump -S main.o > output
The assembly code will push r3 and lr registers, even the function did nothing.
main.o: file format elf32-littlearm
Disassembly of section .text:
00000000 <test>:
void test(){
}
0: e12fff1e bx lr
00000004 <main>:
void main(){
4: e92d4008 push {r3, lr}
test();
8: ebfffffe bl 0 <test>
}
c: e8bd4008 pop {r3, lr}
10: e12fff1e bx lr
My question is why arm gcc choose to push r3 into stack, even test() function never use it? Does gcc just random choose 1 register to push?
If it's for the stack aligned(8 bytes for ARM) requirement, why not just subtract the sp? Thanks.
==================Update==========================
#KemyLand For your answer, I have another example:
The source code is:
void test1(){
}
void test(int i){
test1();
}
void main(){
test(1);
}
I use the same compile command above, then get the following assembly:
main.o: file format elf32-littlearm
Disassembly of section .text:
00000000 <test1>:
void test1(){
}
0: e12fff1e bx lr
00000004 <test>:
void test(int i){
4: e52de004 push {lr} ; (str lr, [sp, #-4]!)
8: e24dd00c sub sp, sp, #12
c: e58d0004 str r0, [sp, #4]
test1();
10: ebfffffe bl 0 <test1>
}
14: e28dd00c add sp, sp, #12
18: e49de004 pop {lr} ; (ldr lr, [sp], #4)
1c: e12fff1e bx lr
00000020 <main>:
void main(){
20: e92d4008 push {r3, lr}
test(1);
24: e3a00001 mov r0, #1
28: ebfffffe bl 4 <test>
}
2c: e8bd4008 pop {r3, lr}
30: e12fff1e bx lr
If push {r3, lr} in first example is for use less instructions, why in this function test(), the compiler didn't just using one instruction?
push {r0, lr}
It use 3 instructions instead of 1.
push {lr}
sub sp, sp #12
str r0, [sp, #4]
By the way, why it sub sp with 12, the stack is 8-bytes aligned, it can just sub it with 4 right?
According to the Standard ARM Embedded ABI, r0 through r3 are used to pass the arguments to a function, and the return value thereof, meanwhile lr (a.k.a: r14) is the link register, whose purpose is to hold the return address for a function.
It's obvious that lr must be saved, as otherwise main() would have no way to return to its caller.
It's now notorious to mention that every single ARM instruction takes 32 bits, and as you mentioned, ARM has a call stack alignment requirement of 8 bytes. And, as a bonus, we're using the Embedded ARM ABI, so code size shall be optimized. Thus, it's more efficient to have a single 32-bit instruction both saving lr and aligning the stack by pushing an unused register (r3 is not needed, because test() does not take arguments nor it returns anything), and then pop in a single 32-bit instruction, rather than adding more instructions (and thus, wasting precious memory!) to manipulate the stack pointer.
After all, it's pretty logical to conclude this is just an optimization from GCC.