Related
I am facing a weird issue. I am passing a uint64_t offset to a function on a 32-bit architecture(Cortex R52). The registers are all 32-bit.
This is the function. :
Sorry, I messed up the function declaration.
void read_func(uint8_t* read_buffer_ptr, uint64_t offset, size_t size);
main(){
// read_buffer : memory where something is read.
// read_buffer_ptr : points to read_buffer structure where value should be stored after reading value.
read_func(read_buffer_ptr, 0, sizeof(read_buffer));
}
In this function, the value stored in offset is not zero but some random values which I also see in the registers(r5, r6). Also, when I use offset as a 32-bit value, it works perfectly fine. The value is copied from r2,r3 into r5,r6.
Can you please let me know why this could be happening? Are registers not enough?
The prototype posted is invalid, it should be:
void read_func(uint8_t *read_buffer_ptr, uint64_t offset, size_t size);
Similarly, the definition main() is obsolete: the implicit int return type is not supported as of c99, the function call has another syntax error with a missing )...
What happens when you pass a 64-bit argument on a 32-bit architecture is implementation defined:
either 8 bytes of stack space are used to pass the value
or 2 32-bit registers are used to pass the least significant part and the most significant part
or a combination of both depending on the number of arguments
or some other scheme appropriate for the target CPU
In your code you pass 0 which has type int and presumably has only 32 bits. This is not a problem if the prototype for read_func was correct and parsed before the function call, otherwise the behavior is undefined and a C99 compiler should not even compile the code, but may compilers will just issue a warning and generate bogus code.
In your case (Cortex R52), the 64-bit argument is passed to read_func in registers r2 and r3.
Cortex-R52 has 32 bits address bus and offset cannot be 64 bits. In calculations only lower 32bits will be used as higher ones will not have any effect.
example:
uint64_t foo(void *buff, uint64_t offset, uint64_t size)
{
unsigned char *cbuff = buff;
while(size--)
{
*(cbuff++ + offset) = size & 0xff;
}
return offset + (uint32_t)cbuff;
}
void *z1(void);
uint64_t z2(void);
uint64_t z3(void);
uint64_t bar(void)
{
return foo(z1(), z2(), z3());
}
foo:
push {r4, lr}
ldr lr, [sp, #8] //size
ldr r1, [sp, #12] //size
mov ip, lr
add r4, r0, r2 // cbuff + offset calculation r3 is ignored as not needed - processor has only 32bits address space.
.L2:
subs ip, ip, #1 //size--
sbc r1, r1, #0 //size--
cmn r1, #1
cmneq ip, #1
bne .L3
add r0, r0, lr
adds r0, r0, r2
adc r1, r3, #0
pop {r4, pc}
.L3:
strb ip, [r4], #1
b .L2
bar:
push {r0, r1, r4, r5, r6, lr}
bl z1
mov r4, r0 // buff
bl z2
mov r6, r0 // offset
mov r5, r1 // offset
bl z3
mov r2, r6 // offset
strd r0, [sp] // size passed on the stack
mov r3, r5 // offset
mov r0, r4 // buff
bl foo
add sp, sp, #8
pop {r4, r5, r6, pc}
As you see resister r2 & r3 contain the offset, r0 - buff and size is on the stack.
I just read https://www.keil.com/support/man/docs/armlink/armlink_pge1406301797482.htm. but can't understand what a veneer is that arm linker inserts between function calls.
In "Procedure Call Standard for the ARM Architecture" document, it says,
5.3.1.1 Use of IP by the linker Both the ARM- and Thumb-state BL instructions are unable to address the full 32-bit address space, so
it may be necessary for the linker to insert a veneer between the
calling routine and the called subroutine. Veneers may also be needed
to support ARM-Thumb inter-working or dynamic linking. Any veneer
inserted must preserve the contents of all registers except IP (r12)
and the condition code flags; a conforming program must assume that a
veneer that alters IP may be inserted at any branch instruction that
is exposed to a relocation that supports inter-working or long
branches. Note R_ARM_CALL, R_ARM_JUMP24, R_ARM_PC24, R_ARM_THM_CALL,
R_ARM_THM_JUMP24 and R_ARM_THM_JUMP19 are examples of the ELF
relocation types with this property. See [AAELF] for full details
Here is what I guess, is it something like this ? : when function A calls function B, and when those two functions are too far apart for the bl command to express, the linker inserts function C between function A and B in such a way function C is close to function B. Now function A uses b instruction to go to function C(copying all the registers between the function call), and function C uses bl instruction(copying all the registers too). Of course the r12 register is used to keep the remaining long jump address bits. Is this what veneer means? (I don't know why arm doesn't explain what veneer is but only what veneer provides..)
It is just a trampoline. Interworking is the easier one to demonstrate, using gnu here, but the implication is that Kiel has a solution as well.
.globl even_more
.type eve_more,%function
even_more:
bx lr
.thumb
.globl more_fun
.thumb_func
more_fun:
bx lr
extern unsigned int more_fun ( unsigned int x );
extern unsigned int even_more ( unsigned int x );
unsigned int fun ( unsigned int a )
{
return(more_fun(a)+even_more(a));
}
Unlinked object:
Disassembly of section .text:
00000000 <fun>:
0: e92d4070 push {r4, r5, r6, lr}
4: e1a05000 mov r5, r0
8: ebfffffe bl 0 <more_fun>
c: e1a04000 mov r4, r0
10: e1a00005 mov r0, r5
14: ebfffffe bl 0 <even_more>
18: e0840000 add r0, r4, r0
1c: e8bd4070 pop {r4, r5, r6, lr}
20: e12fff1e bx lr
Linked binary (yes completely unusable, but demonstrates what the tool does)
Disassembly of section .text:
00001000 <fun>:
1000: e92d4070 push {r4, r5, r6, lr}
1004: e1a05000 mov r5, r0
1008: eb000008 bl 1030 <__more_fun_from_arm>
100c: e1a04000 mov r4, r0
1010: e1a00005 mov r0, r5
1014: eb000002 bl 1024 <even_more>
1018: e0840000 add r0, r4, r0
101c: e8bd4070 pop {r4, r5, r6, lr}
1020: e12fff1e bx lr
00001024 <even_more>:
1024: e12fff1e bx lr
00001028 <more_fun>:
1028: 4770 bx lr
102a: 46c0 nop ; (mov r8, r8)
102c: 0000 movs r0, r0
...
00001030 <__more_fun_from_arm>:
1030: e59fc000 ldr r12, [pc] ; 1038 <__more_fun_from_arm+0x8>
1034: e12fff1c bx r12
1038: 00001029 .word 0x00001029
103c: 00000000 .word 0x00000000
You cannot use bl to switch modes between arm and thumb so the linker has added a trampoline as I call it or have heard it called that you hop on and off to get to the destination. In this case essentially converting the branch part of bl into a bx, the link part they take advantage of just using the bl. You can see this done for thumb to arm or arm to thumb.
The even_more function is in the same mode (ARM) so no need for the trampoline/veneer.
For the distance limit of bl lemme see. Wow, that was easy, and gnu called it a veneer as well:
.globl more_fun
.type more_fun,%function
more_fun:
bx lr
extern unsigned int more_fun ( unsigned int x );
unsigned int fun ( unsigned int a )
{
return(more_fun(a)+1);
}
MEMORY
{
bob : ORIGIN = 0x00000000, LENGTH = 0x1000
ted : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
.some : { so.o(.text*) } > bob
.more : { more.o(.text*) } > ted
}
Disassembly of section .some:
00000000 <fun>:
0: e92d4010 push {r4, lr}
4: eb000003 bl 18 <__more_fun_veneer>
8: e8bd4010 pop {r4, lr}
c: e2800001 add r0, r0, #1
10: e12fff1e bx lr
14: 00000000 andeq r0, r0, r0
00000018 <__more_fun_veneer>:
18: e51ff004 ldr pc, [pc, #-4] ; 1c <__more_fun_veneer+0x4>
1c: 20000000 .word 0x20000000
Disassembly of section .more:
20000000 <more_fun>:
20000000: e12fff1e bx lr
Staying in the same mode it did not need the bx.
The alternative is that you replace every bl instruction at compile time with a more complicated solution just in case you need to do a far call. Or since the bl offset/immediate is computed at link time you can, at link time, put the trampoline/veneer in to change modes or cover the distance.
You should be able to repeat this yourself with Kiel tools, all you needed to do was either switch modes on an external function call or exceed the reach of the bl instruction.
Edit
Understand that toolchains vary and even within a toolchain, gcc 3.x.x was the first to support thumb and I do not know that I saw this back then. Note the linker is part of binutils which is as separate development from gcc. You mention "arm linker", well arm has its own toolchain, then they bought Kiel and perhaps replaced Kiel's with their own or not. Then there is gnu and clang/llvm and others. So it is not a case of "arm linker" doing this or that, it is a case of the toolchains linker doing this or that and each toolchain is first free to use whatever calling convention they want there is no mandate that they have to use ARM's recommendations, second they can choose to implement this or not or simply give you a warning and you have to deal with it (likely in assembly language or through function pointers).
ARM does not need to explain it, or let us say, it is clearly explained in the Architectural Reference Manual (look at the bl instruction, the bx instruction look for the words interworking, etc. All quite clearly explained) for a particular architecture. So there is no reason to explain it again. Especially for a generic statement where the reach of bl varies and each architecture has different interworking features, it would be a long set of paragraphs or a short chapter to explain something that is already clearly documented.
Anyone implementing a compiler and linker would be well versed in the instruction set before hand and understand the bl and conditional branch and other limitations of the instruction set. Some instruction sets offer near and far jumps and some of those the assembly language for the near and far may be the same mnemonic so the assembler will often decide if it does not see the label in the same file to implement a far jump/call rather than a near one so that the objects can be linked.
In any case before linking you have to compile and assembly and the toolchain folks will have fully understood the rules of the architecture. ARM is not special here.
This is Raymond Chen's comment :
The veneer has to be close to A because B is too far away. A does a bl
to the veneer, and the veneer sets r12 to the final destination(B) and
does a bx r12. bx can reach the entire address space.
This answers to my question enough, but he doesn't want to write a full answer (maybe for lack of time..) I put it here as an answer and select it. If someone posts a better, more detailed answer, I'll switch to it.
Here is the disassembly code which compiled from C:
00799d60 <sub_799d60>:
799d60: b573 push {r0, r1, r4, r5, r6, lr}
799d62: 0004 movs r4, r0
799d64: f000 e854 blx 799e10 <jmp_sub_100C54>
799d68: 4b15 ldr r3, [pc, #84] ; (799dc0 <sub_799d60+0x60>)
799d6a: 0005 movs r5, r0
799d6c: 4668 mov r0, sp
799d6e: 4798 blx r3
The target of the subroutine call (799d6e: 4798 blx r3) takes a 64 bit integer pointer argument and returns a 64 bit integer. And that routine is a library function, so I am not able to make any modifications on it.
Could this operation overwrite the stack which storages the lr and r6's value?
You say that the branch target "takes a 64 bit integer pointer argument and returns a 64 bit integer", but this is not the case. It takes a pointer to a 64-bit integer as its only argument (and this pointer is 32 bits long unless you're on aarch64, which I doubt given the rest of the code); and it returns nothing, it simply overwrites the 64-bit value pointed to by the argument you passed in. I'm sure this is what you meant, but be careful with your use of terminology, because the difference between these things is important! In particular there is no 64-bit argument passed either into our out of the function you're calling.
On to the question itself. The key to understanding what the compiler is doing here is to look at the very first line:
push {r0, r1, r4, r5, r6, lr}
The ARM calling convention doesn't require r0 and r1 to be call-preserved, so what are they doing in the list? The answer is that the compiler has added these 'dummy' pushes to create some space on the stack. The push operation above is essentially equivalent to
push {r4, r5, r6, lr}
sub sp, sp, #0x08
except that it saves an instruction. The result is not quite the same, of course, because whatever was in r0 and r1 ends up being written to these locations; but given that there's no way to know what was there beforehand, and the stacked values are about to get overwritten anyway, it's of no consequence. So we have, as a stack frame,
lr
r6
r5
r4
(r1)
sp -> (r0)
with the stack pointer pointing at the space created by the dummy push of r0 and r1. Now we just have
mov r0, sp
which copies the stack pointer to r0 to use as the pointer argument to the function you're calling, which will then overwrite the two words at this location to result in a stack frame of
lr
r6
r5
r4
(64-bit value, high word)
sp -> (64-bit value, low word)
You haven't shown any code beyond the blx r3, so it's not possible to say exactly what happens to the stack at the end of the function. But if this function returns no arguments, I would expect to see a matching
pop {r0, r1, r4, r5, r6, pc}
which will, of course, result in your 64-bit result being left in r0 and r1. But these registers are call-clobbered according to the calling convention so there's no problem.
I am trying to write an ARM program that takes three numbers and calculates the discriminant. It has two source files, driver.s & prog3.s. I understand how to find the discriminate, but how do I pass the values A, B, & C into the discrim function from the main function? I have included the code I typed thus far....
MAIN() driver.s
avalue .reg r0
bvalue .req r1
cvalue .req r2
final .req r3
loopcount .req r4
readA:
.ascii “%d”
readB:
.ascii “%d”
readC:
.ascii “%d”
addressReadA: .word readA
addressReadB: .word readB
addressReadC: .word readC
main:
ldr avalue, addressReadA # load in avalue
ldr bvalue, addressReadB # load in bvalue
ldr cvalue, addressReadC # load in cvalue
DISCRIM() prog3.s
avalue .reg r0
bvalue .req r1
cvalue .req r2
final .req r3
discrim:
mul bvalue, bvalue, bvalue # square bvalue
mul avalue, avalue, #4 # multiply avalue by 4
mul cvalue, avalue, cvalue # multiply avalue by cvalue
add final, bvalue, cvalue # calculated discriminant
Going with the calling convention that C compilers use is not a bad idea, esp since if you go from pure assembly programs to C and asm mixed, you already have that experience. And/or you may see the simplicity and wisdom in the calling conventions used.
How do you know what the calling convention for a compiler is? 1) read the manual/documentation and google. 2) just try it. Prototype a function that is similar in the number of operands the type of operands and return value and feed it real-ish numbers and see what it produces.
Compiling to asm sometimes works but with pseudo instructions and other things done by the assembler I prefer to dissemble than to compile to asm YMMV.
unsigned int fun ( unsigned int a, unsigned int b, unsigned int c );
unsigned int test ( void )
{
return(fun(1,2,3));
}
which with gnu currently results in
00000000 <test>:
0: e92d4010 push {r4, lr}
4: e3a02003 mov r2, #3
8: e3a01002 mov r1, #2
c: e3a00001 mov r0, #1
10: ebfffffe bl 0 <fun>
14: e8bd4010 pop {r4, lr}
18: e12fff1e bx lr
Each combination of compiler and target may have a different calling convention, there is no reason to assume that different compilers or versions of the same compiler use the same convention. ARM, MIPS, and no doubt others try to help/encourage/suggest a calling convention to use and some compilers simply follow that, why not.
There are lots of exceptions to the rule in the convention, but for ARM for the first up to four registers worth of parameters, in this case for up to four signed or unsigned integers or up to four less than or equal to 32 bit quantities (float can create exceptions) the first four general purposes regisers are used r0 for the first parameter r1 for the second and so on. And currently the standard keeps the stack aligned on 64 bit boundaries.
So we see that the first parameter is indeed placed in r0 the second in r1 and third in r2, obviously you dont have to arrange those three instructions in that order, doesnt matter.
because this function is calling another function it has to preserve its return value in lr so that goes on the stack, because the standard says to keep the stack aligned on 64 bit boundaries they are pushing another register on the stack r4 is arbitrary it could be any register, this is the one the tool chose.
because the standard says to return in r0, code that implements one of these functions.
unsigned int fun ( unsigned int a, unsigned int b, unsigned int c )
{
return(a+b^c);
}
00000000 <fun>:
0: e0800001 add r0, r0, r1
4: e0200002 eor r0, r0, r2
8: e12fff1e bx lr
it is very interesting now that I see this that the compiler did not do a tail optimization on the call, it could have not saved lr and did a branch to fun, since the return value in r0 is what test() was also returning in the same register. really kind of baffled that that didnt happen.
but you can see that indeed the return value is left in r0, and per the convention we can trash r0-r3 we dont have to preserve them, and these functions are not.
if you change test to this
unsigned int fun ( unsigned int a, unsigned int b, unsigned int c );
unsigned int test ( void )
{
return(fun(1,2,3)+7);
}
then it cant tail optimize and also shows the return register so you dont have to create a fun() function to see it.
00000000 <test>:
0: e92d4010 push {r4, lr}
4: e3a02003 mov r2, #3
8: e3a01002 mov r1, #2
c: e3a00001 mov r0, #1
10: ebfffffe bl 0 <fun>
14: e8bd4010 pop {r4, lr}
18: e2800007 add r0, r0, #7
1c: e12fff1e bx lr
you can do this kind of thing with other targets or other compilers, and there is no reason to assume that one target has the same convention as another.
Disassembly of section .text:
00000000 <fun>:
0: 0f 5e add r14, r15
2: 0f ed xor r13, r15
4: 30 41 ret
0000000000000000 <fun>:
0: 8d 04 37 lea (%rdi,%rsi,1),%eax
3: 31 d0 xor %edx,%eax
5: c3 retq
and this one is stack based instead of register based
Disassembly of section .text:
00000000 <_fun>:
0: 1166 mov r5, -(sp)
2: 1185 mov sp, r5
4: 1d41 0004 mov 4(r5), r1
8: 6d41 0006 add 6(r5), r1
c: 1d40 0008 mov 10(r5), r0
10: 7840 xor r1, r0
12: 1585 mov (sp)+, r5
14: 0087 rts pc
But if this is just a pure assembly project and you dont have to interface with compiled output, do whatever you want, part of designing the project is not just each individual function but how they interact, no different than C or Python or some other language you have to still define the interface for yourself between functions. Assembly doesnt make that special or different, just another language.
This question already has answers here:
ARM: Why do I need to push/pop two registers at function calls?
(3 answers)
Closed 12 months ago.
I tried to write a simple test code like this(main.c):
main.c
void test(){
}
void main(){
test();
}
Then I used arm-non-eabi-gcc to compile and objdump to get the assembly code:
arm-none-eabi-gcc -g -fno-defer-pop -fomit-frame-pointer -c main.c
arm-none-eabi-objdump -S main.o > output
The assembly code will push r3 and lr registers, even the function did nothing.
main.o: file format elf32-littlearm
Disassembly of section .text:
00000000 <test>:
void test(){
}
0: e12fff1e bx lr
00000004 <main>:
void main(){
4: e92d4008 push {r3, lr}
test();
8: ebfffffe bl 0 <test>
}
c: e8bd4008 pop {r3, lr}
10: e12fff1e bx lr
My question is why arm gcc choose to push r3 into stack, even test() function never use it? Does gcc just random choose 1 register to push?
If it's for the stack aligned(8 bytes for ARM) requirement, why not just subtract the sp? Thanks.
==================Update==========================
#KemyLand For your answer, I have another example:
The source code is:
void test1(){
}
void test(int i){
test1();
}
void main(){
test(1);
}
I use the same compile command above, then get the following assembly:
main.o: file format elf32-littlearm
Disassembly of section .text:
00000000 <test1>:
void test1(){
}
0: e12fff1e bx lr
00000004 <test>:
void test(int i){
4: e52de004 push {lr} ; (str lr, [sp, #-4]!)
8: e24dd00c sub sp, sp, #12
c: e58d0004 str r0, [sp, #4]
test1();
10: ebfffffe bl 0 <test1>
}
14: e28dd00c add sp, sp, #12
18: e49de004 pop {lr} ; (ldr lr, [sp], #4)
1c: e12fff1e bx lr
00000020 <main>:
void main(){
20: e92d4008 push {r3, lr}
test(1);
24: e3a00001 mov r0, #1
28: ebfffffe bl 4 <test>
}
2c: e8bd4008 pop {r3, lr}
30: e12fff1e bx lr
If push {r3, lr} in first example is for use less instructions, why in this function test(), the compiler didn't just using one instruction?
push {r0, lr}
It use 3 instructions instead of 1.
push {lr}
sub sp, sp #12
str r0, [sp, #4]
By the way, why it sub sp with 12, the stack is 8-bytes aligned, it can just sub it with 4 right?
According to the Standard ARM Embedded ABI, r0 through r3 are used to pass the arguments to a function, and the return value thereof, meanwhile lr (a.k.a: r14) is the link register, whose purpose is to hold the return address for a function.
It's obvious that lr must be saved, as otherwise main() would have no way to return to its caller.
It's now notorious to mention that every single ARM instruction takes 32 bits, and as you mentioned, ARM has a call stack alignment requirement of 8 bytes. And, as a bonus, we're using the Embedded ARM ABI, so code size shall be optimized. Thus, it's more efficient to have a single 32-bit instruction both saving lr and aligning the stack by pushing an unused register (r3 is not needed, because test() does not take arguments nor it returns anything), and then pop in a single 32-bit instruction, rather than adding more instructions (and thus, wasting precious memory!) to manipulate the stack pointer.
After all, it's pretty logical to conclude this is just an optimization from GCC.