Are C compilers like gcc smart enough to bit shifts in place?

Are C compilers like gcc smart enough to bit shifts in place? - c

Unlike assembly code, in C there is no way to bit shift a value in place. To shift the bits in variable an assignment must always be performed:
x = x << 3;
Are compilers like gcc smart enough to realize that this is an in-place bit shift and compile it like this:
shl x, 3
or will the compiler put the result first in a register, then move it back into x (which would require two extra unnecessary instructions).

Any good compiler with optimization turned on will handle bit shifts efficiently.
Compilers will keep small objects in registers when feasible and efficient and will not store them to memory even if you write assignment statements, until they are forced to by circumstances.
Additionally, it is not desirable on typical modern processors to try to shift the bits of a value in memory. Generally, memory hardware does not have any capability to manipulate stored values. To change the value of something in memory, it must be moved to the processor (loaded), changed, and moved back (stored). Whether this is done in one instruction or several is not generally an indication of how fast or efficient it is, because the processor still has to do the individual load, shift, store operations, and the performance of those is highly dependent on the processor model.
Except in exceptional programming situations, you should not be worrying about performance at this level.

what did you see when you tried it? why not just try it?
unsigned int fun ( unsigned int x )
{
return (x<<3);
}
Disassembly of section .text:
00000000 <fun>:
0: e1a00180 lsl r0, r0, #3
4: e12fff1e bx lr
Disassembly of section .text:
00000000 <_fun>:
0: 1166 mov r5, -(sp)
2: 1185 mov sp, r5
4: 1d40 0004 mov 4(r5), r0
8: 0cc0 asl r0
a: 0cc0 asl r0
c: 0cc0 asl r0
e: 1585 mov (sp)+, r5
10: 0087 rts pc
Disassembly of section .text:
0000000000000000 <fun>:
0: 531d7000 lsl w0, w0, #3
4: d65f03c0 ret
Disassembly of section .text:
0000000000000000 <fun>:
0: 8d 04 fd 00 00 00 00 lea 0x0(,%rdi,8),%eax
7: c3 retq
00000000 <fun>:
0: 42 18 0c 5c rpt #3 { rlax.w r12 ;
4: 30 41 ret
Disassembly of section .text:
00000000 <fun>:
0: 050e slli x10,x10,0x3
2: 8082 ret
unsigned int x;
void fun ( void )
{
x=x<<3;
}
Disassembly of section .text:
00000000 <fun>:
0: e59f200c ldr r2, [pc, #12] ; 14 <fun+0x14>
4: e5923000 ldr r3, [r2]
8: e1a03183 lsl r3, r3, #3
c: e5823000 str r3, [r2]
10: e12fff1e bx lr
14: 00000000 andeq r0, r0, r0
and so on

Related

In the ARM ABI, how are global variables accessed?

I am writing a simple multitasking OS for the ARM Cortex M3. My threads always run using the Process Stack Pointer. I have an application that I inherited and that uses global variables. I am trying to call the functions in that application from my threading code but it is not accessing memory correctly. Are the following statements correct:
Those global variables are accessed via some kind of relative addressing, and that relative address is placed on the Main stack (using MSP)?
My threading code, using PSP, will never be able to access them
I need to switch to MSP when calling these functions, then back to PSP when using my threads?
**EDIT: Clarified that this is for a Cortex M

Global variables have nothing to do with the stack, even static locals.
So you need to just look at the output of the compiler, it will tell you everything.
Your question is very vague you could be asking one of many different questions. I will show some basics and maybe I will get lucky.
Note that this should in general have nothing to do with the processor, mode, etc. arm, thumb, x86, whatever. Much more to do with the toolchain.
If this is too basic and you are asking some very advanced question it is not obvious to me I will delete or rewrite, no problem.
Throwaway code is always a good idea to figure things out.
flash.s
.thumb
.syntax unified
.word 0x20001000
.word reset
.thumb_func
reset:
bl notmain
b .
notmain.c
unsigned int x;
unsigned int y=5;
void notmain ( void )
{
unsigned int z=7;
x=++y;
z--;
}
flash.ld
MEMORY
{
rom : ORIGIN = 0x00080000, LENGTH = 0x00001000
ram : ORIGIN = 0x20000000, LENGTH = 0x00001000
}
SECTIONS
{
.text : { *(.text) } > rom
.bss : { *(.bss) } > ram
.data : { *(.data) } > ram
}
build
arm-none-eabi-as --warn --fatal-warnings -mcpu=cortex-m0 flash.s -o flash.o
arm-none-eabi-gcc -Wall -O2 -ffreestanding -mcpu=cortex-m0 -c notmain.c -o notmain.o
arm-none-eabi-ld -nostdlib -nostartfiles -T flash.ld flash.o notmain.o -o flash.elf
arm-none-eabi-objdump -D flash.elf > flash.list
arm-none-eabi-objcopy -O binary flash.elf flash.bin
examine
Disassembly of section .text:
00080000 <reset-0x8>:
80000: 20001000 andcs r1, r0, r0
80004: 00080009 andeq r0, r8, r9
00080008 <reset>:
80008: f000 f802 bl 80010 <notmain>
8000c: e7fe b.n 8000c <reset+0x4>
...
00080010 <notmain>:
80010: 4b04 ldr r3, [pc, #16] ; (80024 <notmain+0x14>)
80012: 4905 ldr r1, [pc, #20] ; (80028 <notmain+0x18>)
80014: 681a ldr r2, [r3, #0]
80016: 3201 adds r2, #1
80018: 601a str r2, [r3, #0]
8001a: 600a str r2, [r1, #0]
8001c: 685a ldr r2, [r3, #4]
8001e: 3a01 subs r2, #1
80020: 605a str r2, [r3, #4]
80022: 4770 bx lr
80024: 20000004 andcs r0, r0, r4
80028: 20000000 andcs r0, r0, r0
Disassembly of section .bss:
20000000 <x>:
20000000: 00000000 andeq r0, r0, r0
Disassembly of section .data:
20000004 <y>:
20000004: 00000005 andeq r0, r0, r5
20000008 <z.3645>:
20000008: 00000007 andeq r0, r0, r7
This is basic not relocatable, etc.
80010: 4b04 ldr r3, [pc, #16] ; (80024 <notmain+0x14>)
80014: 681a ldr r2, [r3, #0]
80016: 3201 adds r2, #1
80018: 601a str r2, [r3, #0]
80024: 20000004 andcs r0, r0, r4
Disassembly of section .data:
20000004 <y>:
20000004: 00000005 andeq r0, r0, r5
We can see the y++. r3 gets the address to y, r2 gets the value of y
r2 increments, and then is saved back to memory.
And you can see how x and z are handled as well.
Now this cannot work for an mcu for a couple of reasons. The 0x20000000
address information will not be there. Only what is in non-volatile storage
will be there when the chip powers up and comes out of reset. The above is relevant depending on what your real question is.
MEMORY
{
rom : ORIGIN = 0x00080000, LENGTH = 0x00001000
ram : ORIGIN = 0x20000000, LENGTH = 0x00001000
}
SECTIONS
{
.text : { *(.text) } > rom
.bss : { *(.bss) } > ram AT > rom
.data : { *(.data) } > ram AT > rom
}
The program does not change, but the binary does
00000000 00 10 00 20 09 00 08 00 00 f0 02 f8 fe e7 00 00 |... ............|
00000010 04 4b 05 49 1a 68 01 32 1a 60 0a 60 5a 68 01 3a |.K.I.h.2.`.`Zh.:|
00000020 5a 60 70 47 04 00 00 20 00 00 00 20 05 00 00 00 |Z`pG... ... ....|
00000030 07 00 00 00 |....|
00000034
At 0x2C we see the preload value for y and at 0x30 for z.
The .bss value is not located here. Normally what you do is add a whole lot
more stuff to the linker script to get the addresses of things. Data start and stop, and bss start and size or stop. Then a bootstrap that copies from flash to ram so that the initialized values are in ram and the read/write works.
So if your project, call it an operating system or not, is just one large body of code that is compiled and linked all together. Then without doing special things like lots of sections or something. The above is what you are looking at and the stack is not related to globals. Because it never is normally.
(msp/psp does not work the way arm implies they do, I have yet to see a use case for the second stack pointer, IF the processor even has it they do not all have it implemented)
Now if your threads are actually separately built programs that you load runtime...Then they completely live in ram. So
MEMORY
{
rom : ORIGIN = 0x00080000, LENGTH = 0x00001000
ram : ORIGIN = 0x20000000, LENGTH = 0x00001000
}
SECTIONS
{
.text : { *(.text) } > ram
.bss : { *(.bss) } > ram
.data : { *(.data) } > ram
}
and we add -fPIC
arm-none-eabi-gcc -Wall -O2 -ffreestanding -mcpu=cortex-m0 -fPIC -c notmain.c -o notmain.o
Disassembly of section .text:
20000000 <reset-0x8>:
20000000: 20001000 andcs r1, r0, r0
20000004: 20000009 andcs r0, r0, r9
20000008 <reset>:
20000008: f000 f802 bl 20000010 <notmain>
2000000c: e7fe b.n 2000000c <reset+0x4>
...
20000010 <notmain>:
20000010: 4a07 ldr r2, [pc, #28] ; (20000030 <notmain+0x20>)
20000012: 4b08 ldr r3, [pc, #32] ; (20000034 <notmain+0x24>)
20000014: 447a add r2, pc
20000016: 58d1 ldr r1, [r2, r3]
20000018: 680b ldr r3, [r1, #0]
2000001a: 3301 adds r3, #1
2000001c: 600b str r3, [r1, #0]
2000001e: 4906 ldr r1, [pc, #24] ; (20000038 <notmain+0x28>)
20000020: 5852 ldr r2, [r2, r1]
20000022: 6013 str r3, [r2, #0]
20000024: 4a05 ldr r2, [pc, #20] ; (2000003c <notmain+0x2c>)
20000026: 447a add r2, pc
20000028: 6813 ldr r3, [r2, #0]
2000002a: 3b01 subs r3, #1
2000002c: 6013 str r3, [r2, #0]
2000002e: 4770 bx lr
20000030: 00000034 andeq r0, r0, r4, lsr r0
20000034: 00000004 andeq r0, r0, r4
20000038: 00000000 andeq r0, r0, r0
2000003c: 0000001a andeq r0, r0, sl, lsl r0
Disassembly of section .bss:
20000040 <x>:
20000040: 00000000 andeq r0, r0, r0
Disassembly of section .data:
20000044 <z.3645>:
20000044: 00000007 andeq r0, r0, r7
20000048 <y>:
20000048: 00000005 andeq r0, r0, r5
Disassembly of section .got:
2000004c <.got>:
2000004c: 20000040 andcs r0, r0, r0, asr #32
20000050: 20000048 andcs r0, r0, r8, asr #32
Disassembly of section .got.plt:
20000054 <_GLOBAL_OFFSET_TABLE_>:
...
Because you may need to be able to load the program anywhere in ram (within rules).
The code is all relative, but the data because of the nature of compiling and linking needs some hardcoding. So they setup a global offset table GOT. The location of the got is relative to the code, you cannot change that.
20000010: 4a07 ldr r2, [pc, #28] ; (20000030 <notmain+0x20>)
20000012: 4b08 ldr r3, [pc, #32] ; (20000034 <notmain+0x24>)
20000014: 447a add r2, pc
20000016: 58d1 ldr r1, [r2, r3]
20000018: 680b ldr r3, [r1, #0]
2000001a: 3301 adds r3, #1
2000001c: 600b str r3, [r1, #0]
There is your y++ when built position independent.
r2 gets an offset, r3 gets another offset. r2 is the relative offset to
the got from the code, (you cannot separate them and move one around and not
the other, not what position independent means) so now r2 points to the
GOT. r3 is the offset in the GOT to the address of y. r1 gets the address
of y and now it is like before get y in r3, add one, save y to memory.
Now IF you were to relocate this to an address that is not 0x20000000 your
bootstrap needs to go to the GOT and patch up all the addresses so you need
linker magic to get where the got is and how bit it is, etc...Use the pc to
figure out where you are and then make the adjustments. If loaded into memory at 0x20002000 then you need to add 0x2000 to each of the entries
in the table and then it will all just work. (still no stack stuff, stack is not related).
A little trick if you have the space.
Notice I put bss before data, and I have at least one .data item. If you can guarantee that (force a .data in your bootstrap for example).
00000000 00 10 00 20 09 00 00 20 00 f0 02 f8 fe e7 00 00 |... ... ........|
00000010 07 4a 08 4b 7a 44 d1 58 0b 68 01 33 0b 60 06 49 |.J.KzD.X.h.3.`.I|
00000020 52 58 13 60 05 4a 7a 44 13 68 01 3b 13 60 70 47 |RX.`.JzD.h.;.`pG|
00000030 34 00 00 00 04 00 00 00 00 00 00 00 1a 00 00 00 |4...............|
00000040 00 00 00 00 07 00 00 00 05 00 00 00 40 00 00 20 |............#.. |
00000050 48 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 |H.. ............|
00000060
20000040 <x>:
20000040: 00000000 andeq r0, r0, r0
Objdump pads the binary for a -O binary with zeros for .bss If you put it last then it is not assumed to work.
So I do not know how this code you have uses threads and globals, does it try to keep variables specific to each thread? If so does it use static locals up front then pass the address on the stack (and even there the stack pointer you use does not matter unless you are not properly using the stack in general, if not then globals are not your problem.).
If you start off the thread or any code on one stack pointer and implying
completely separate stacks (memory address spaces). And then switch, abandoning stack information needed for the code to work in and out of
functions, and then if you return from functions after switching stacks all
the code would break not just pointers to static locals that are passed along.
So a minimal example that demonstrates the problem can confirm for us what is really going on and what your questions really are and what the problem is. If you want to use the two stack pointers for a cortex-m you need to carefully read up and you need to also write some throwaway code examples to see how it works, and then apply that to the code the tools are generating.
Again if this is too elementary and I am miles away from the real question, I will certainly delete this no problem.

What is 'veneer' that arm linker uses in function call?

I just read https://www.keil.com/support/man/docs/armlink/armlink_pge1406301797482.htm. but can't understand what a veneer is that arm linker inserts between function calls.
In "Procedure Call Standard for the ARM Architecture" document, it says,
5.3.1.1 Use of IP by the linker Both the ARM- and Thumb-state BL instructions are unable to address the full 32-bit address space, so
it may be necessary for the linker to insert a veneer between the
calling routine and the called subroutine. Veneers may also be needed
to support ARM-Thumb inter-working or dynamic linking. Any veneer
inserted must preserve the contents of all registers except IP (r12)
and the condition code flags; a conforming program must assume that a
veneer that alters IP may be inserted at any branch instruction that
is exposed to a relocation that supports inter-working or long
branches. Note R_ARM_CALL, R_ARM_JUMP24, R_ARM_PC24, R_ARM_THM_CALL,
R_ARM_THM_JUMP24 and R_ARM_THM_JUMP19 are examples of the ELF
relocation types with this property. See [AAELF] for full details
Here is what I guess, is it something like this ? : when function A calls function B, and when those two functions are too far apart for the bl command to express, the linker inserts function C between function A and B in such a way function C is close to function B. Now function A uses b instruction to go to function C(copying all the registers between the function call), and function C uses bl instruction(copying all the registers too). Of course the r12 register is used to keep the remaining long jump address bits. Is this what veneer means? (I don't know why arm doesn't explain what veneer is but only what veneer provides..)

It is just a trampoline. Interworking is the easier one to demonstrate, using gnu here, but the implication is that Kiel has a solution as well.
.globl even_more
.type eve_more,%function
even_more:
bx lr
.thumb
.globl more_fun
.thumb_func
more_fun:
bx lr
extern unsigned int more_fun ( unsigned int x );
extern unsigned int even_more ( unsigned int x );
unsigned int fun ( unsigned int a )
{
return(more_fun(a)+even_more(a));
}
Unlinked object:
Disassembly of section .text:
00000000 <fun>:
0: e92d4070 push {r4, r5, r6, lr}
4: e1a05000 mov r5, r0
8: ebfffffe bl 0 <more_fun>
c: e1a04000 mov r4, r0
10: e1a00005 mov r0, r5
14: ebfffffe bl 0 <even_more>
18: e0840000 add r0, r4, r0
1c: e8bd4070 pop {r4, r5, r6, lr}
20: e12fff1e bx lr
Linked binary (yes completely unusable, but demonstrates what the tool does)
Disassembly of section .text:
00001000 <fun>:
1000: e92d4070 push {r4, r5, r6, lr}
1004: e1a05000 mov r5, r0
1008: eb000008 bl 1030 <__more_fun_from_arm>
100c: e1a04000 mov r4, r0
1010: e1a00005 mov r0, r5
1014: eb000002 bl 1024 <even_more>
1018: e0840000 add r0, r4, r0
101c: e8bd4070 pop {r4, r5, r6, lr}
1020: e12fff1e bx lr
00001024 <even_more>:
1024: e12fff1e bx lr
00001028 <more_fun>:
1028: 4770 bx lr
102a: 46c0 nop ; (mov r8, r8)
102c: 0000 movs r0, r0
...
00001030 <__more_fun_from_arm>:
1030: e59fc000 ldr r12, [pc] ; 1038 <__more_fun_from_arm+0x8>
1034: e12fff1c bx r12
1038: 00001029 .word 0x00001029
103c: 00000000 .word 0x00000000
You cannot use bl to switch modes between arm and thumb so the linker has added a trampoline as I call it or have heard it called that you hop on and off to get to the destination. In this case essentially converting the branch part of bl into a bx, the link part they take advantage of just using the bl. You can see this done for thumb to arm or arm to thumb.
The even_more function is in the same mode (ARM) so no need for the trampoline/veneer.
For the distance limit of bl lemme see. Wow, that was easy, and gnu called it a veneer as well:
.globl more_fun
.type more_fun,%function
more_fun:
bx lr
extern unsigned int more_fun ( unsigned int x );
unsigned int fun ( unsigned int a )
{
return(more_fun(a)+1);
}
MEMORY
{
bob : ORIGIN = 0x00000000, LENGTH = 0x1000
ted : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
.some : { so.o(.text*) } > bob
.more : { more.o(.text*) } > ted
}
Disassembly of section .some:
00000000 <fun>:
0: e92d4010 push {r4, lr}
4: eb000003 bl 18 <__more_fun_veneer>
8: e8bd4010 pop {r4, lr}
c: e2800001 add r0, r0, #1
10: e12fff1e bx lr
14: 00000000 andeq r0, r0, r0
00000018 <__more_fun_veneer>:
18: e51ff004 ldr pc, [pc, #-4] ; 1c <__more_fun_veneer+0x4>
1c: 20000000 .word 0x20000000
Disassembly of section .more:
20000000 <more_fun>:
20000000: e12fff1e bx lr
Staying in the same mode it did not need the bx.
The alternative is that you replace every bl instruction at compile time with a more complicated solution just in case you need to do a far call. Or since the bl offset/immediate is computed at link time you can, at link time, put the trampoline/veneer in to change modes or cover the distance.
You should be able to repeat this yourself with Kiel tools, all you needed to do was either switch modes on an external function call or exceed the reach of the bl instruction.
Edit
Understand that toolchains vary and even within a toolchain, gcc 3.x.x was the first to support thumb and I do not know that I saw this back then. Note the linker is part of binutils which is as separate development from gcc. You mention "arm linker", well arm has its own toolchain, then they bought Kiel and perhaps replaced Kiel's with their own or not. Then there is gnu and clang/llvm and others. So it is not a case of "arm linker" doing this or that, it is a case of the toolchains linker doing this or that and each toolchain is first free to use whatever calling convention they want there is no mandate that they have to use ARM's recommendations, second they can choose to implement this or not or simply give you a warning and you have to deal with it (likely in assembly language or through function pointers).
ARM does not need to explain it, or let us say, it is clearly explained in the Architectural Reference Manual (look at the bl instruction, the bx instruction look for the words interworking, etc. All quite clearly explained) for a particular architecture. So there is no reason to explain it again. Especially for a generic statement where the reach of bl varies and each architecture has different interworking features, it would be a long set of paragraphs or a short chapter to explain something that is already clearly documented.
Anyone implementing a compiler and linker would be well versed in the instruction set before hand and understand the bl and conditional branch and other limitations of the instruction set. Some instruction sets offer near and far jumps and some of those the assembly language for the near and far may be the same mnemonic so the assembler will often decide if it does not see the label in the same file to implement a far jump/call rather than a near one so that the objects can be linked.
In any case before linking you have to compile and assembly and the toolchain folks will have fully understood the rules of the architecture. ARM is not special here.

This is Raymond Chen's comment :
The veneer has to be close to A because B is too far away. A does a bl
to the veneer, and the veneer sets r12 to the final destination(B) and
does a bx r12. bx can reach the entire address space.
This answers to my question enough, but he doesn't want to write a full answer (maybe for lack of time..) I put it here as an answer and select it. If someone posts a better, more detailed answer, I'll switch to it.

Difference between position dependent and position independent code?

I understand that the current gcc compilers by default generate position independent code. However, to get an understanding of how position dependent code looked like, I compiled this
int Add(int x, int y) {
return x+y;
}
int Subtract(int x, int y) {
return x-y;
}
int main() {
bool flag = false;
int x=10,y=5,z;
if (flag) {
z = Add(x,y);
}
else {
z = Subtract(x,y);
}
}
as g++ -c check.cpp -no-pie. However, the generated code is identical with or without the -no-pie flag. <main+0x34> looks to be a relative offset.
26: 55 push %rbp
27: 48 89 e5 mov %rsp,%rbp
2a: 48 83 ec 10 sub $0x10,%rsp
2e: c6 45 f3 00 movb $0x0,-0xd(%rbp)
32: c7 45 f4 0a 00 00 00 movl $0xa,-0xc(%rbp)
39: c7 45 f8 05 00 00 00 movl $0x5,-0x8(%rbp)
40: 80 7d f3 00 cmpb $0x0,-0xd(%rbp)
44: 74 14 je 5a <main+0x34>
46: 8b 55 f8 mov -0x8(%rbp),%edx
49: 8b 45 f4 mov -0xc(%rbp),%eax
4c: 89 d6 mov %edx,%esi
4e: 89 c7 mov %eax,%edi
50: e8 00 00 00 00 callq 55 <main+0x2f>
55: 89 45 fc mov %eax,-0x4(%rbp)
58: eb 12 jmp 6c <main+0x46>
5a: 8b 55 f8 mov -0x8(%rbp),%edx
5d: 8b 45 f4 mov -0xc(%rbp),%eax
60: 89 d6 mov %edx,%esi
62: 89 c7 mov %eax,%edi
64: e8 00 00 00 00 callq 69 <main+0x43>
69: 89 45 fc mov %eax,-0x4(%rbp)
6c: b8 00 00 00 00 mov $0x0,%eax
71: c9 leaveq
72: c3 retq
is the objdump in both cases for just the main. Am I not using the correct flag or is the assembly code supposed to be same for PIC and non-PIC for this code chunk. If it is supposed to be the same, could you please provide a snippet for which it isn't!

You have to access items that are outside the module or section to see a difference.
unsigned int x;
void fun ( void )
{
x = 5;
}
so this crosses over .text to .data.
position dependent.
00000000 <fun>:
0: e3a02005 mov r2, #5
4: e59f3004 ldr r3, [pc, #4] ; 10 <fun+0x10>
8: e5832000 str r2, [r3]
c: e12fff1e bx lr
10: 00000000
position independent
00000000 <fun>:
0: e3a02005 mov r2, #5
4: e59f3010 ldr r3, [pc, #16] ; 1c <fun+0x1c>
8: e59f1010 ldr r1, [pc, #16] ; 20 <fun+0x20>
c: e08f3003 add r3, pc, r3
10: e7933001 ldr r3, [r3, r1]
14: e5832000 str r2, [r3]
18: e12fff1e bx lr
1c: 00000008
20: 00000000
In the first case the linker will fill in the address to the memory location
8: e5832000 str r2, [r3]
c: e12fff1e bx lr
10: 00000000 <--- here
the pc relative addressing from 4: to 10: is within the .text section so dependent or independent are fine.
4: e59f3004 ldr r3, [pc, #4] ; 10 <fun+0x10>
8: e5832000 str r2, [r3]
c: e12fff1e bx lr
10: 00000000
it gets the address to the external entity, filled in by the linker, and then directly access that item at that address.
4: e59f3010 ldr r3, [pc, #16] ; 1c <fun+0x1c>
8: e59f1010 ldr r1, [pc, #16] ; 20 <fun+0x20>
c: e08f3003 add r3, pc, r3
10: e7933001 ldr r3, [r3, r1]
14: e5832000 str r2, [r3]
18: e12fff1e bx lr
1c: 00000008
20: 00000000
is easier to see linked (-Ttext=0x1000 -Tdata=0x2000)
00001000 <fun>:
1000: e3a02005 mov r2, #5
1004: e59f3010 ldr r3, [pc, #16] ; 101c <fun+0x1c>
1008: e59f1010 ldr r1, [pc, #16] ; 1020 <fun+0x20>
100c: e08f3003 add r3, pc, r3
1010: e7933001 ldr r3, [r3, r1]
1014: e5832000 str r2, [r3]
1018: e12fff1e bx lr
101c: 00010010
1020: 0000000c
Disassembly of section .got:
00011024 <_GLOBAL_OFFSET_TABLE_>:
...
11030: 00002000
Disassembly of section .bss:
00002000 <x>:
2000: 00000000
(clearly I should have also specified where the GOT goes).
While the global offset table and .bss are different sections once linked they are fixed relative to each other. What position independence gives is the ability to move .bss (or .data, etc) relative to .text. So if you think about the position dependent solution, if .data were to move and you had say 1000 references sprinkled all through the binary, in order to move .bss you would have to patch every one of those.
Instead the global offset table here provides a single location where the address of the variable x lives, and all access to variable x will essentially use double indirection to access. It may not be obvious but a position dependent way to get at a table like this would be for the linker to fill in its absolute address, but that would not be independent and this was compiled to be independent so pc relative math has to be done to find the global offset table, so for this instruction set when executing the instruction at 0x100c the program counter is 0x100c+8.
100c: e08f3003 add r3, pc, r3
So we are adding 0x100C+8+0x00010010 = 0x11024 and adding 0x0000000c to that giving 0x11030. So compute the address to the GOT then the offset within that, and THAT gives us the address to the item. 0x2000. So you do the second indirection there to get at the item.
If you were to place .text at an address other than 0x1000 but don't move .bss that is fine this will all work so long that the GOT moves to the same relative offset from .text. If you were to leave .text but move .bss then you have to update the GOT, if you move .bss from 0x2000 to 0x3000 then that is a difference of +0x1000 so you then go through the GOT and add 0x1000 to each item to cover that difference.
Position independence essentially has to do double indirection instead of single indirection (or one more level than would have been needed for position dependent) in order to access distant items or items not position dependent relative to .text. Which means more code, more memory access. It is more code and slower.
For it to work .text reaching out to other .text items cant use fixed addresses it has to use indirect/computed addresses. Likewise the GOT as used here (by GNU) has to be at a fixed relative position to .text. Then from there you can move data relative to code and still access it. So you have to have some rules. .text being code and assumed read only cant support this offset table which needs to be in ram, so it cant simply be built into the .text section.

ARM Parameter Passing

I am trying to write an ARM program that takes three numbers and calculates the discriminant. It has two source files, driver.s & prog3.s. I understand how to find the discriminate, but how do I pass the values A, B, & C into the discrim function from the main function? I have included the code I typed thus far....
MAIN() driver.s
avalue .reg r0
bvalue .req r1
cvalue .req r2
final .req r3
loopcount .req r4
readA:
.ascii “%d”
readB:
.ascii “%d”
readC:
.ascii “%d”
addressReadA: .word readA
addressReadB: .word readB
addressReadC: .word readC
main:
ldr avalue, addressReadA # load in avalue
ldr bvalue, addressReadB # load in bvalue
ldr cvalue, addressReadC # load in cvalue
DISCRIM() prog3.s
avalue .reg r0
bvalue .req r1
cvalue .req r2
final .req r3
discrim:
mul bvalue, bvalue, bvalue # square bvalue
mul avalue, avalue, #4 # multiply avalue by 4
mul cvalue, avalue, cvalue # multiply avalue by cvalue
add final, bvalue, cvalue # calculated discriminant

Going with the calling convention that C compilers use is not a bad idea, esp since if you go from pure assembly programs to C and asm mixed, you already have that experience. And/or you may see the simplicity and wisdom in the calling conventions used.
How do you know what the calling convention for a compiler is? 1) read the manual/documentation and google. 2) just try it. Prototype a function that is similar in the number of operands the type of operands and return value and feed it real-ish numbers and see what it produces.
Compiling to asm sometimes works but with pseudo instructions and other things done by the assembler I prefer to dissemble than to compile to asm YMMV.
unsigned int fun ( unsigned int a, unsigned int b, unsigned int c );
unsigned int test ( void )
{
return(fun(1,2,3));
}
which with gnu currently results in
00000000 <test>:
0: e92d4010 push {r4, lr}
4: e3a02003 mov r2, #3
8: e3a01002 mov r1, #2
c: e3a00001 mov r0, #1
10: ebfffffe bl 0 <fun>
14: e8bd4010 pop {r4, lr}
18: e12fff1e bx lr
Each combination of compiler and target may have a different calling convention, there is no reason to assume that different compilers or versions of the same compiler use the same convention. ARM, MIPS, and no doubt others try to help/encourage/suggest a calling convention to use and some compilers simply follow that, why not.
There are lots of exceptions to the rule in the convention, but for ARM for the first up to four registers worth of parameters, in this case for up to four signed or unsigned integers or up to four less than or equal to 32 bit quantities (float can create exceptions) the first four general purposes regisers are used r0 for the first parameter r1 for the second and so on. And currently the standard keeps the stack aligned on 64 bit boundaries.
So we see that the first parameter is indeed placed in r0 the second in r1 and third in r2, obviously you dont have to arrange those three instructions in that order, doesnt matter.
because this function is calling another function it has to preserve its return value in lr so that goes on the stack, because the standard says to keep the stack aligned on 64 bit boundaries they are pushing another register on the stack r4 is arbitrary it could be any register, this is the one the tool chose.
because the standard says to return in r0, code that implements one of these functions.
unsigned int fun ( unsigned int a, unsigned int b, unsigned int c )
{
return(a+b^c);
}
00000000 <fun>:
0: e0800001 add r0, r0, r1
4: e0200002 eor r0, r0, r2
8: e12fff1e bx lr
it is very interesting now that I see this that the compiler did not do a tail optimization on the call, it could have not saved lr and did a branch to fun, since the return value in r0 is what test() was also returning in the same register. really kind of baffled that that didnt happen.
but you can see that indeed the return value is left in r0, and per the convention we can trash r0-r3 we dont have to preserve them, and these functions are not.
if you change test to this
unsigned int fun ( unsigned int a, unsigned int b, unsigned int c );
unsigned int test ( void )
{
return(fun(1,2,3)+7);
}
then it cant tail optimize and also shows the return register so you dont have to create a fun() function to see it.
00000000 <test>:
0: e92d4010 push {r4, lr}
4: e3a02003 mov r2, #3
8: e3a01002 mov r1, #2
c: e3a00001 mov r0, #1
10: ebfffffe bl 0 <fun>
14: e8bd4010 pop {r4, lr}
18: e2800007 add r0, r0, #7
1c: e12fff1e bx lr
you can do this kind of thing with other targets or other compilers, and there is no reason to assume that one target has the same convention as another.
Disassembly of section .text:
00000000 <fun>:
0: 0f 5e add r14, r15
2: 0f ed xor r13, r15
4: 30 41 ret
0000000000000000 <fun>:
0: 8d 04 37 lea (%rdi,%rsi,1),%eax
3: 31 d0 xor %edx,%eax
5: c3 retq
and this one is stack based instead of register based
Disassembly of section .text:
00000000 <_fun>:
0: 1166 mov r5, -(sp)
2: 1185 mov sp, r5
4: 1d41 0004 mov 4(r5), r1
8: 6d41 0006 add 6(r5), r1
c: 1d40 0008 mov 10(r5), r0
10: 7840 xor r1, r0
12: 1585 mov (sp)+, r5
14: 0087 rts pc
But if this is just a pure assembly project and you dont have to interface with compiled output, do whatever you want, part of designing the project is not just each individual function but how they interact, no different than C or Python or some other language you have to still define the interface for yourself between functions. Assembly doesnt make that special or different, just another language.

Would Thumb-2 ARM-Core Micros From Different Manufacturers Have Same Codesize?

Comparing two Thumb-2 micros from two different manufacturers. One's a Cortex M3, one's an A5. Are they guaranteed to compile a particular piece of code to the same codesize?

so here goes
fun.c
unsigned int fun ( unsigned int x )
{
return(x);
}
addimm.c
extern unsigned int fun ( unsigned int );
unsigned int addimm ( unsigned int x )
{
return(fun(x)+0x123);
}
for demonstration purposes building for bare metal, not really a functional program but it compiles clean and demonstrates what I intend to demonstrate.
arm instructions
arm-none-eabi-gcc -Wall -O2 -nostdlib -nostartfiles -ffreestanding -mcpu=cortex-a5 -march=armv7-a -c addimm.c -o addimma.o
disassembly of the object, not linked
00000000 <addimm>:
0: e92d4008 push {r3, lr}
4: ebfffffe bl 0 <fun>
8: e2800e12 add r0, r0, #288 ; 0x120
c: e2800003 add r0, r0, #3
10: e8bd8008 pop {r3, pc}
thumb generic (armv4 or v5 whatever the default was for this compiler build)
arm-none-eabi-gcc -Wall -O2 -nostdlib -nostartfiles -ffreestanding -mthumb -c addimm.c -o addimmt.o
00000000 <addimm>:
0: b508 push {r3, lr}
2: f7ff fffe bl 0 <fun>
6: 3024 adds r0, #36 ; 0x24
8: 30ff adds r0, #255 ; 0xff
a: bc08 pop {r3}
c: bc02 pop {r1}
e: 4708 bx r1
cortex-a5 specific
arm-none-eabi-gcc -Wall -O2 -nostdlib -nostartfiles -ffreestanding -mthumb -mcpu=cortex-a5 -march=armv7-a -c addimm.c -o addimma5.o
00000000 <addimm>:
0: b508 push {r3, lr}
2: f7ff fffe bl 0 <fun>
6: f200 1023 addw r0, r0, #291 ; 0x123
a: bd08 pop {r3, pc}
cortex-a5 is armv7-a which supports thumb-2 as far as the add immediate itself goes and related to binary size there is no optimization here, 32 bits for thumb and 32 bits for thumb2. But this is but one example there perhaps will be times that thumb2 produces smaller binaries than thumb.
cortex-m3
arm-none-eabi-gcc -Wall -O2 -nostdlib -nostartfiles -ffreestanding -mthumb -mcpu=cortex-m3 -march=armv7-m -c addimm.c -o addimmm3.o
00000000 <addimm>:
0: b508 push {r3, lr}
2: f7ff fffe bl 0 <fun>
6: f200 1023 addw r0, r0, #291 ; 0x123
a: bd08 pop {r3, pc}
produced the same result as cortex-a5. for this simple example the machine code for this object is the same, same size, when built for cortex-a5 and cortex-m3
Now if I add a bootstrap, a main, and call this function and fill in the function it calls to create a complete, linked, program
00000000 <_start>:
0: f000 f802 bl 8 <notmain>
4: e7fe b.n 4 <_start+0x4>
...
00000008 <notmain>:
8: 2005 movs r0, #5
a: f000 b801 b.w 10 <addimm>
e: bf00 nop
00000010 <addimm>:
10: b508 push {r3, lr}
12: f000 f803 bl 1c <fun>
16: f200 1023 addw r0, r0, #291 ; 0x123
1a: bd08 pop {r3, pc}
0000001c <fun>:
1c: 4770 bx lr
1e: 46c0 nop ; (mov r8, r8)
We get a result. The addimm function itself did not change in size. with a cortex-a5 you have to have some arm code that then switches to thumb, and likely when linking with libraries, etc you may get a mixture of arm and thumb, so
00000000 <_start>:
0: eb000000 bl 8 <notmain>
4: eafffffe b 4 <_start+0x4>
00000008 <notmain>:
8: e92d4008 push {r3, lr}
c: e3a00005 mov r0, #5
10: fa000001 blx 1c <addimm>
14: e8bd4008 pop {r3, lr}
18: e12fff1e bx lr
0000001c <addimm>:
1c: b508 push {r3, lr}
1e: f000 e804 blx 28 <fun>
22: f200 1023 addw r0, r0, #291 ; 0x123
26: bd08 pop {r3, pc}
00000028 <fun>:
28: e12fff1e bx lr
overall larger binary, the addimm part itself did not change in size though.
as far as linking changing the size of the object, look at this example
bootstrap.s
.thumb
.thumb_func
.globl _start
_start:
bl notmain
hang: b hang
.thumb_func
.globl dummy
dummy:
bx lr
.code 32
.globl bounce
bounce:
bx lr
hello.c
void dummy ( void );
void bounce ( void );
void notmain ( void )
{
dummy();
bounce();
}
looking at an arm build of notmain by itself, the object:
00000000 <notmain>:
0: e92d4800 push {fp, lr}
4: e28db004 add fp, sp, #4
8: ebfffffe bl 0 <dummy>
c: ebfffffe bl 0 <bounce>
10: e24bd004 sub sp, fp, #4
14: e8bd4800 pop {fp, lr}
18: e12fff1e bx lr
depending on what is calling it and what it calls, the linker may have to add more code to deal with items that are defined outside the object, from global variables to external functions
00008000 <_start>:
8000: f000 f818 bl 8034 <__notmain_from_thumb>
00008004 <hang>:
8004: e7fe b.n 8004 <hang>
00008006 <dummy>:
8006: 4770 bx lr
00008008 <bounce>:
8008: e12fff1e bx lr
0000800c <notmain>:
800c: e92d4800 push {fp, lr}
8010: e28db004 add fp, sp, #4
8014: eb000003 bl 8028 <__dummy_from_arm>
8018: ebfffffa bl 8008 <bounce>
801c: e24bd004 sub sp, fp, #4
8020: e8bd4800 pop {fp, lr}
8024: e12fff1e bx lr
00008028 <__dummy_from_arm>:
8028: e59fc000 ldr ip, [pc] ; 8030 <__dummy_from_arm+0x8>
802c: e12fff1c bx ip
8030: 00008007 andeq r8, r0, r7
00008034 <__notmain_from_thumb>:
8034: 4778 bx pc
8036: 46c0 nop ; (mov r8, r8)
8038: eafffff3 b 800c <notmain>
803c: 00000000 andeq r0, r0, r0
dummy_from_arm and notmain_from_thumb were both added, an increase in the size of the binary. each object did not change in size but the overall binary did. bounce() was an arm to arm function, no patching, dummy() arm to thumb and notmain() thumb to main.
so you might have a cortex-m3 object, and a cortex-a5 object that as far as the code in that object goes they are both identical. But dopending on what you link them with, which eventually something is dfferent between a cortex-m3 system and a cortex-a5 system, you may see more or less code added by the linker to account for the system differences, libraries, operating system specific, etc even so much as where in the binary you put the object, if it has to have a further reach than it can with a single instruction, then the linker will add even more code.
This is all gcc specific stuff, each toolchain is going to deal with each of these problems in its own way. It is the nature of the beast when you use an object and linker model, a very good model but the compiler, assembler, and linker have to work together to make sure that global resources can be properly accessed when linked. has nothing to do with ARM, this problem exists with many/most processor architectures and the toolchains deal with those problems per toolchain, per version, per target architecture. When I said change the size of the object what I really meant was the linker may add more code to the final binary in order to deal with that object and how it interacts with others.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Are C compilers like gcc smart enough to bit shifts in place? - c

Related

In the ARM ABI, how are global variables accessed?

What is 'veneer' that arm linker uses in function call?

Difference between position dependent and position independent code?

ARM Parameter Passing

Would Thumb-2 ARM-Core Micros From Different Manufacturers Have Same Codesize?

Categories

Resources