I understand that the current gcc compilers by default generate position independent code. However, to get an understanding of how position dependent code looked like, I compiled this
int Add(int x, int y) {
return x+y;
}
int Subtract(int x, int y) {
return x-y;
}
int main() {
bool flag = false;
int x=10,y=5,z;
if (flag) {
z = Add(x,y);
}
else {
z = Subtract(x,y);
}
}
as g++ -c check.cpp -no-pie. However, the generated code is identical with or without the -no-pie flag. <main+0x34> looks to be a relative offset.
26: 55 push %rbp
27: 48 89 e5 mov %rsp,%rbp
2a: 48 83 ec 10 sub $0x10,%rsp
2e: c6 45 f3 00 movb $0x0,-0xd(%rbp)
32: c7 45 f4 0a 00 00 00 movl $0xa,-0xc(%rbp)
39: c7 45 f8 05 00 00 00 movl $0x5,-0x8(%rbp)
40: 80 7d f3 00 cmpb $0x0,-0xd(%rbp)
44: 74 14 je 5a <main+0x34>
46: 8b 55 f8 mov -0x8(%rbp),%edx
49: 8b 45 f4 mov -0xc(%rbp),%eax
4c: 89 d6 mov %edx,%esi
4e: 89 c7 mov %eax,%edi
50: e8 00 00 00 00 callq 55 <main+0x2f>
55: 89 45 fc mov %eax,-0x4(%rbp)
58: eb 12 jmp 6c <main+0x46>
5a: 8b 55 f8 mov -0x8(%rbp),%edx
5d: 8b 45 f4 mov -0xc(%rbp),%eax
60: 89 d6 mov %edx,%esi
62: 89 c7 mov %eax,%edi
64: e8 00 00 00 00 callq 69 <main+0x43>
69: 89 45 fc mov %eax,-0x4(%rbp)
6c: b8 00 00 00 00 mov $0x0,%eax
71: c9 leaveq
72: c3 retq
is the objdump in both cases for just the main. Am I not using the correct flag or is the assembly code supposed to be same for PIC and non-PIC for this code chunk. If it is supposed to be the same, could you please provide a snippet for which it isn't!
You have to access items that are outside the module or section to see a difference.
unsigned int x;
void fun ( void )
{
x = 5;
}
so this crosses over .text to .data.
position dependent.
00000000 <fun>:
0: e3a02005 mov r2, #5
4: e59f3004 ldr r3, [pc, #4] ; 10 <fun+0x10>
8: e5832000 str r2, [r3]
c: e12fff1e bx lr
10: 00000000
position independent
00000000 <fun>:
0: e3a02005 mov r2, #5
4: e59f3010 ldr r3, [pc, #16] ; 1c <fun+0x1c>
8: e59f1010 ldr r1, [pc, #16] ; 20 <fun+0x20>
c: e08f3003 add r3, pc, r3
10: e7933001 ldr r3, [r3, r1]
14: e5832000 str r2, [r3]
18: e12fff1e bx lr
1c: 00000008
20: 00000000
In the first case the linker will fill in the address to the memory location
8: e5832000 str r2, [r3]
c: e12fff1e bx lr
10: 00000000 <--- here
the pc relative addressing from 4: to 10: is within the .text section so dependent or independent are fine.
4: e59f3004 ldr r3, [pc, #4] ; 10 <fun+0x10>
8: e5832000 str r2, [r3]
c: e12fff1e bx lr
10: 00000000
it gets the address to the external entity, filled in by the linker, and then directly access that item at that address.
4: e59f3010 ldr r3, [pc, #16] ; 1c <fun+0x1c>
8: e59f1010 ldr r1, [pc, #16] ; 20 <fun+0x20>
c: e08f3003 add r3, pc, r3
10: e7933001 ldr r3, [r3, r1]
14: e5832000 str r2, [r3]
18: e12fff1e bx lr
1c: 00000008
20: 00000000
is easier to see linked (-Ttext=0x1000 -Tdata=0x2000)
00001000 <fun>:
1000: e3a02005 mov r2, #5
1004: e59f3010 ldr r3, [pc, #16] ; 101c <fun+0x1c>
1008: e59f1010 ldr r1, [pc, #16] ; 1020 <fun+0x20>
100c: e08f3003 add r3, pc, r3
1010: e7933001 ldr r3, [r3, r1]
1014: e5832000 str r2, [r3]
1018: e12fff1e bx lr
101c: 00010010
1020: 0000000c
Disassembly of section .got:
00011024 <_GLOBAL_OFFSET_TABLE_>:
...
11030: 00002000
Disassembly of section .bss:
00002000 <x>:
2000: 00000000
(clearly I should have also specified where the GOT goes).
While the global offset table and .bss are different sections once linked they are fixed relative to each other. What position independence gives is the ability to move .bss (or .data, etc) relative to .text. So if you think about the position dependent solution, if .data were to move and you had say 1000 references sprinkled all through the binary, in order to move .bss you would have to patch every one of those.
Instead the global offset table here provides a single location where the address of the variable x lives, and all access to variable x will essentially use double indirection to access. It may not be obvious but a position dependent way to get at a table like this would be for the linker to fill in its absolute address, but that would not be independent and this was compiled to be independent so pc relative math has to be done to find the global offset table, so for this instruction set when executing the instruction at 0x100c the program counter is 0x100c+8.
100c: e08f3003 add r3, pc, r3
So we are adding 0x100C+8+0x00010010 = 0x11024 and adding 0x0000000c to that giving 0x11030. So compute the address to the GOT then the offset within that, and THAT gives us the address to the item. 0x2000. So you do the second indirection there to get at the item.
If you were to place .text at an address other than 0x1000 but don't move .bss that is fine this will all work so long that the GOT moves to the same relative offset from .text. If you were to leave .text but move .bss then you have to update the GOT, if you move .bss from 0x2000 to 0x3000 then that is a difference of +0x1000 so you then go through the GOT and add 0x1000 to each item to cover that difference.
Position independence essentially has to do double indirection instead of single indirection (or one more level than would have been needed for position dependent) in order to access distant items or items not position dependent relative to .text. Which means more code, more memory access. It is more code and slower.
For it to work .text reaching out to other .text items cant use fixed addresses it has to use indirect/computed addresses. Likewise the GOT as used here (by GNU) has to be at a fixed relative position to .text. Then from there you can move data relative to code and still access it. So you have to have some rules. .text being code and assumed read only cant support this offset table which needs to be in ram, so it cant simply be built into the .text section.
Related
I am writing a simple multitasking OS for the ARM Cortex M3. My threads always run using the Process Stack Pointer. I have an application that I inherited and that uses global variables. I am trying to call the functions in that application from my threading code but it is not accessing memory correctly. Are the following statements correct:
Those global variables are accessed via some kind of relative addressing, and that relative address is placed on the Main stack (using MSP)?
My threading code, using PSP, will never be able to access them
I need to switch to MSP when calling these functions, then back to PSP when using my threads?
**EDIT: Clarified that this is for a Cortex M
Global variables have nothing to do with the stack, even static locals.
So you need to just look at the output of the compiler, it will tell you everything.
Your question is very vague you could be asking one of many different questions. I will show some basics and maybe I will get lucky.
Note that this should in general have nothing to do with the processor, mode, etc. arm, thumb, x86, whatever. Much more to do with the toolchain.
If this is too basic and you are asking some very advanced question it is not obvious to me I will delete or rewrite, no problem.
Throwaway code is always a good idea to figure things out.
flash.s
.thumb
.syntax unified
.word 0x20001000
.word reset
.thumb_func
reset:
bl notmain
b .
notmain.c
unsigned int x;
unsigned int y=5;
void notmain ( void )
{
unsigned int z=7;
x=++y;
z--;
}
flash.ld
MEMORY
{
rom : ORIGIN = 0x00080000, LENGTH = 0x00001000
ram : ORIGIN = 0x20000000, LENGTH = 0x00001000
}
SECTIONS
{
.text : { *(.text) } > rom
.bss : { *(.bss) } > ram
.data : { *(.data) } > ram
}
build
arm-none-eabi-as --warn --fatal-warnings -mcpu=cortex-m0 flash.s -o flash.o
arm-none-eabi-gcc -Wall -O2 -ffreestanding -mcpu=cortex-m0 -c notmain.c -o notmain.o
arm-none-eabi-ld -nostdlib -nostartfiles -T flash.ld flash.o notmain.o -o flash.elf
arm-none-eabi-objdump -D flash.elf > flash.list
arm-none-eabi-objcopy -O binary flash.elf flash.bin
examine
Disassembly of section .text:
00080000 <reset-0x8>:
80000: 20001000 andcs r1, r0, r0
80004: 00080009 andeq r0, r8, r9
00080008 <reset>:
80008: f000 f802 bl 80010 <notmain>
8000c: e7fe b.n 8000c <reset+0x4>
...
00080010 <notmain>:
80010: 4b04 ldr r3, [pc, #16] ; (80024 <notmain+0x14>)
80012: 4905 ldr r1, [pc, #20] ; (80028 <notmain+0x18>)
80014: 681a ldr r2, [r3, #0]
80016: 3201 adds r2, #1
80018: 601a str r2, [r3, #0]
8001a: 600a str r2, [r1, #0]
8001c: 685a ldr r2, [r3, #4]
8001e: 3a01 subs r2, #1
80020: 605a str r2, [r3, #4]
80022: 4770 bx lr
80024: 20000004 andcs r0, r0, r4
80028: 20000000 andcs r0, r0, r0
Disassembly of section .bss:
20000000 <x>:
20000000: 00000000 andeq r0, r0, r0
Disassembly of section .data:
20000004 <y>:
20000004: 00000005 andeq r0, r0, r5
20000008 <z.3645>:
20000008: 00000007 andeq r0, r0, r7
This is basic not relocatable, etc.
80010: 4b04 ldr r3, [pc, #16] ; (80024 <notmain+0x14>)
80014: 681a ldr r2, [r3, #0]
80016: 3201 adds r2, #1
80018: 601a str r2, [r3, #0]
80024: 20000004 andcs r0, r0, r4
Disassembly of section .data:
20000004 <y>:
20000004: 00000005 andeq r0, r0, r5
We can see the y++. r3 gets the address to y, r2 gets the value of y
r2 increments, and then is saved back to memory.
And you can see how x and z are handled as well.
Now this cannot work for an mcu for a couple of reasons. The 0x20000000
address information will not be there. Only what is in non-volatile storage
will be there when the chip powers up and comes out of reset. The above is relevant depending on what your real question is.
MEMORY
{
rom : ORIGIN = 0x00080000, LENGTH = 0x00001000
ram : ORIGIN = 0x20000000, LENGTH = 0x00001000
}
SECTIONS
{
.text : { *(.text) } > rom
.bss : { *(.bss) } > ram AT > rom
.data : { *(.data) } > ram AT > rom
}
The program does not change, but the binary does
00000000 00 10 00 20 09 00 08 00 00 f0 02 f8 fe e7 00 00 |... ............|
00000010 04 4b 05 49 1a 68 01 32 1a 60 0a 60 5a 68 01 3a |.K.I.h.2.`.`Zh.:|
00000020 5a 60 70 47 04 00 00 20 00 00 00 20 05 00 00 00 |Z`pG... ... ....|
00000030 07 00 00 00 |....|
00000034
At 0x2C we see the preload value for y and at 0x30 for z.
The .bss value is not located here. Normally what you do is add a whole lot
more stuff to the linker script to get the addresses of things. Data start and stop, and bss start and size or stop. Then a bootstrap that copies from flash to ram so that the initialized values are in ram and the read/write works.
So if your project, call it an operating system or not, is just one large body of code that is compiled and linked all together. Then without doing special things like lots of sections or something. The above is what you are looking at and the stack is not related to globals. Because it never is normally.
(msp/psp does not work the way arm implies they do, I have yet to see a use case for the second stack pointer, IF the processor even has it they do not all have it implemented)
Now if your threads are actually separately built programs that you load runtime...Then they completely live in ram. So
MEMORY
{
rom : ORIGIN = 0x00080000, LENGTH = 0x00001000
ram : ORIGIN = 0x20000000, LENGTH = 0x00001000
}
SECTIONS
{
.text : { *(.text) } > ram
.bss : { *(.bss) } > ram
.data : { *(.data) } > ram
}
and we add -fPIC
arm-none-eabi-gcc -Wall -O2 -ffreestanding -mcpu=cortex-m0 -fPIC -c notmain.c -o notmain.o
Disassembly of section .text:
20000000 <reset-0x8>:
20000000: 20001000 andcs r1, r0, r0
20000004: 20000009 andcs r0, r0, r9
20000008 <reset>:
20000008: f000 f802 bl 20000010 <notmain>
2000000c: e7fe b.n 2000000c <reset+0x4>
...
20000010 <notmain>:
20000010: 4a07 ldr r2, [pc, #28] ; (20000030 <notmain+0x20>)
20000012: 4b08 ldr r3, [pc, #32] ; (20000034 <notmain+0x24>)
20000014: 447a add r2, pc
20000016: 58d1 ldr r1, [r2, r3]
20000018: 680b ldr r3, [r1, #0]
2000001a: 3301 adds r3, #1
2000001c: 600b str r3, [r1, #0]
2000001e: 4906 ldr r1, [pc, #24] ; (20000038 <notmain+0x28>)
20000020: 5852 ldr r2, [r2, r1]
20000022: 6013 str r3, [r2, #0]
20000024: 4a05 ldr r2, [pc, #20] ; (2000003c <notmain+0x2c>)
20000026: 447a add r2, pc
20000028: 6813 ldr r3, [r2, #0]
2000002a: 3b01 subs r3, #1
2000002c: 6013 str r3, [r2, #0]
2000002e: 4770 bx lr
20000030: 00000034 andeq r0, r0, r4, lsr r0
20000034: 00000004 andeq r0, r0, r4
20000038: 00000000 andeq r0, r0, r0
2000003c: 0000001a andeq r0, r0, sl, lsl r0
Disassembly of section .bss:
20000040 <x>:
20000040: 00000000 andeq r0, r0, r0
Disassembly of section .data:
20000044 <z.3645>:
20000044: 00000007 andeq r0, r0, r7
20000048 <y>:
20000048: 00000005 andeq r0, r0, r5
Disassembly of section .got:
2000004c <.got>:
2000004c: 20000040 andcs r0, r0, r0, asr #32
20000050: 20000048 andcs r0, r0, r8, asr #32
Disassembly of section .got.plt:
20000054 <_GLOBAL_OFFSET_TABLE_>:
...
Because you may need to be able to load the program anywhere in ram (within rules).
The code is all relative, but the data because of the nature of compiling and linking needs some hardcoding. So they setup a global offset table GOT. The location of the got is relative to the code, you cannot change that.
20000010: 4a07 ldr r2, [pc, #28] ; (20000030 <notmain+0x20>)
20000012: 4b08 ldr r3, [pc, #32] ; (20000034 <notmain+0x24>)
20000014: 447a add r2, pc
20000016: 58d1 ldr r1, [r2, r3]
20000018: 680b ldr r3, [r1, #0]
2000001a: 3301 adds r3, #1
2000001c: 600b str r3, [r1, #0]
There is your y++ when built position independent.
r2 gets an offset, r3 gets another offset. r2 is the relative offset to
the got from the code, (you cannot separate them and move one around and not
the other, not what position independent means) so now r2 points to the
GOT. r3 is the offset in the GOT to the address of y. r1 gets the address
of y and now it is like before get y in r3, add one, save y to memory.
Now IF you were to relocate this to an address that is not 0x20000000 your
bootstrap needs to go to the GOT and patch up all the addresses so you need
linker magic to get where the got is and how bit it is, etc...Use the pc to
figure out where you are and then make the adjustments. If loaded into memory at 0x20002000 then you need to add 0x2000 to each of the entries
in the table and then it will all just work. (still no stack stuff, stack is not related).
A little trick if you have the space.
Notice I put bss before data, and I have at least one .data item. If you can guarantee that (force a .data in your bootstrap for example).
00000000 00 10 00 20 09 00 00 20 00 f0 02 f8 fe e7 00 00 |... ... ........|
00000010 07 4a 08 4b 7a 44 d1 58 0b 68 01 33 0b 60 06 49 |.J.KzD.X.h.3.`.I|
00000020 52 58 13 60 05 4a 7a 44 13 68 01 3b 13 60 70 47 |RX.`.JzD.h.;.`pG|
00000030 34 00 00 00 04 00 00 00 00 00 00 00 1a 00 00 00 |4...............|
00000040 00 00 00 00 07 00 00 00 05 00 00 00 40 00 00 20 |............#.. |
00000050 48 00 00 20 00 00 00 00 00 00 00 00 00 00 00 00 |H.. ............|
00000060
20000040 <x>:
20000040: 00000000 andeq r0, r0, r0
Objdump pads the binary for a -O binary with zeros for .bss If you put it last then it is not assumed to work.
So I do not know how this code you have uses threads and globals, does it try to keep variables specific to each thread? If so does it use static locals up front then pass the address on the stack (and even there the stack pointer you use does not matter unless you are not properly using the stack in general, if not then globals are not your problem.).
If you start off the thread or any code on one stack pointer and implying
completely separate stacks (memory address spaces). And then switch, abandoning stack information needed for the code to work in and out of
functions, and then if you return from functions after switching stacks all
the code would break not just pointers to static locals that are passed along.
So a minimal example that demonstrates the problem can confirm for us what is really going on and what your questions really are and what the problem is. If you want to use the two stack pointers for a cortex-m you need to carefully read up and you need to also write some throwaway code examples to see how it works, and then apply that to the code the tools are generating.
Again if this is too elementary and I am miles away from the real question, I will certainly delete this no problem.
Unlike assembly code, in C there is no way to bit shift a value in place. To shift the bits in variable an assignment must always be performed:
x = x << 3;
Are compilers like gcc smart enough to realize that this is an in-place bit shift and compile it like this:
shl x, 3
or will the compiler put the result first in a register, then move it back into x (which would require two extra unnecessary instructions).
Any good compiler with optimization turned on will handle bit shifts efficiently.
Compilers will keep small objects in registers when feasible and efficient and will not store them to memory even if you write assignment statements, until they are forced to by circumstances.
Additionally, it is not desirable on typical modern processors to try to shift the bits of a value in memory. Generally, memory hardware does not have any capability to manipulate stored values. To change the value of something in memory, it must be moved to the processor (loaded), changed, and moved back (stored). Whether this is done in one instruction or several is not generally an indication of how fast or efficient it is, because the processor still has to do the individual load, shift, store operations, and the performance of those is highly dependent on the processor model.
Except in exceptional programming situations, you should not be worrying about performance at this level.
what did you see when you tried it? why not just try it?
unsigned int fun ( unsigned int x )
{
return (x<<3);
}
Disassembly of section .text:
00000000 <fun>:
0: e1a00180 lsl r0, r0, #3
4: e12fff1e bx lr
Disassembly of section .text:
00000000 <_fun>:
0: 1166 mov r5, -(sp)
2: 1185 mov sp, r5
4: 1d40 0004 mov 4(r5), r0
8: 0cc0 asl r0
a: 0cc0 asl r0
c: 0cc0 asl r0
e: 1585 mov (sp)+, r5
10: 0087 rts pc
Disassembly of section .text:
0000000000000000 <fun>:
0: 531d7000 lsl w0, w0, #3
4: d65f03c0 ret
Disassembly of section .text:
0000000000000000 <fun>:
0: 8d 04 fd 00 00 00 00 lea 0x0(,%rdi,8),%eax
7: c3 retq
00000000 <fun>:
0: 42 18 0c 5c rpt #3 { rlax.w r12 ;
4: 30 41 ret
Disassembly of section .text:
00000000 <fun>:
0: 050e slli x10,x10,0x3
2: 8082 ret
unsigned int x;
void fun ( void )
{
x=x<<3;
}
Disassembly of section .text:
00000000 <fun>:
0: e59f200c ldr r2, [pc, #12] ; 14 <fun+0x14>
4: e5923000 ldr r3, [r2]
8: e1a03183 lsl r3, r3, #3
c: e5823000 str r3, [r2]
10: e12fff1e bx lr
14: 00000000 andeq r0, r0, r0
and so on
I was going through some bare metal programming tutorials. While reading about C code execution I came to know that we need to setup C execution environment like initializing stack zeroing bss etc.
In some cases you have to copy data in ram , and need to provide startup code for that as well. Link of tutorial which says copy data in RAM.
Now I have two doubts.
If we need to copy data in RAM then why don't we copy code ie text segment. If we don't copy text segment doest it mean code is executed from SD card itself in case of Raspberry pi 3(Arm embedded processor).
When we specify linker script like below, does it suggest to copy those section in RAM or these sections will be mapped in RAM address?
Sorry I am really confuse.
MEMORY
{
ram : ORIGIN = 0x8000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > ram
.bss : { *(.bss*) } > ram
}
Any help is appreciated.
vectors.s
.globl _start
_start:
mov sp,#0x8000
bl notmain
b .
notmain.c
unsigned int x;
unsigned int y=0x12345678;
void notmain ( void )
{
x=y+7;
}
memmap
MEMORY
{
bob : ORIGIN = 0x80000000, LENGTH = 0x1000
ted : ORIGIN = 0x8000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > ted
.rodata : { *(.rodata*) } > ted
.bss : { *(.bss*) } > ted
.data : { *(.data*) } > ted
}
build
arm-none-eabi-as --warn --fatal-warnings vectors.s -o vectors.o
arm-none-eabi-gcc -Wall -Werror -O2 -nostdlib -nostartfiles -ffreestanding -c notmain.c -o notmain.o
arm-none-eabi-ld vectors.o notmain.o -T memmap -o notmain.elf
arm-none-eabi-objdump -D notmain.elf > notmain.list
arm-none-eabi-objcopy notmain.elf -O binary kernel.img
you can add/remove options, and name it the right kernelX.img (and if you are venturing into 64 bit then use aarch64-whatever-gcc instead of arm-whatever-gcc...
Looking at the dissassembly
Disassembly of section .text:
00008000 <_start>:
8000: e3a0d902 mov sp, #32768 ; 0x8000
8004: eb000000 bl 800c <notmain>
8008: eafffffe b 8008 <_start+0x8>
0000800c <notmain>:
800c: e59f3010 ldr r3, [pc, #16] ; 8024 <notmain+0x18>
8010: e5933000 ldr r3, [r3]
8014: e59f200c ldr r2, [pc, #12] ; 8028 <notmain+0x1c>
8018: e2833007 add r3, r3, #7
801c: e5823000 str r3, [r2]
8020: e12fff1e bx lr
8024: 00008030 andeq r8, r0, r0, lsr r0
8028: 0000802c andeq r8, r0, r12, lsr #32
Disassembly of section .bss:
0000802c <x>:
802c: 00000000 andeq r0, r0, r0
Disassembly of section .data:
00008030 <y>:
8030: 12345678 eorsne r5, r4, #120, 12 ; 0x7800000
and comparing that to the kernelX.img file
hexdump -C kernel.img
00000000 02 d9 a0 e3 00 00 00 eb fe ff ff ea 10 30 9f e5 |.............0..|
00000010 00 30 93 e5 0c 20 9f e5 07 30 83 e2 00 30 82 e5 |.0... ...0...0..|
00000020 1e ff 2f e1 30 80 00 00 2c 80 00 00 00 00 00 00 |../.0...,.......|
00000030 78 56 34 12 |xV4.|
00000034
Note that because I put .data after .bss in the linker script it put them in that order in the image. there are four bytes of zeros after the last word in .text and the 0x12345678 of .data
If you swap the positions of .bss and .data in the linker script
0000802c <y>:
802c: 12345678 eorsne r5, r4, #120, 12 ; 0x7800000
Disassembly of section .bss:
00008030 <x>:
8030: 00000000 andeq r0, r0, r0
00000000 02 d9 a0 e3 00 00 00 eb fe ff ff ea 10 30 9f e5 |.............0..|
00000010 00 30 93 e5 0c 20 9f e5 07 30 83 e2 00 30 82 e5 |.0... ...0...0..|
00000020 1e ff 2f e1 2c 80 00 00 30 80 00 00 78 56 34 12 |../.,...0...xV4.|
00000030
Ooops, no freebie. Now .bss is not zeroed and you would need to zero it in your bootstrap (if you have a .bss area and as a programming style you assume those items are zero when you first use them).
Okay so how do you find where .bss is? well that is what the tutorial and countless others are showing you.
.globl _start
_start:
mov sp,#0x8000
bl notmain
b .
linker_stuff:
.word hello_world
.word world_hello
MEMORY
{
bob : ORIGIN = 0x80000000, LENGTH = 0x1000
ted : ORIGIN = 0x8000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > ted
.rodata : { *(.rodata*) } > ted
.data : { *(.data*) } > ted
hello_world = .;
.bss : { *(.bss*) } > ted
world_hello = .;
}
build and disassemble
Disassembly of section .text:
00008000 <_start>:
8000: e3a0d902 mov sp, #32768 ; 0x8000
8004: eb000002 bl 8014 <notmain>
8008: eafffffe b 8008 <_start+0x8>
0000800c <linker_stuff>:
800c: 00008038 andeq r8, r0, r8, lsr r0
8010: 0000803c andeq r8, r0, r12, lsr r0
00008014 <notmain>:
8014: e59f3010 ldr r3, [pc, #16] ; 802c <notmain+0x18>
8018: e5933000 ldr r3, [r3]
801c: e59f200c ldr r2, [pc, #12] ; 8030 <notmain+0x1c>
8020: e2833007 add r3, r3, #7
8024: e5823000 str r3, [r2]
8028: e12fff1e bx lr
802c: 00008034 andeq r8, r0, r4, lsr r0
8030: 00008038 andeq r8, r0, r8, lsr r0
Disassembly of section .data:
00008034 <y>:
8034: 12345678 eorsne r5, r4, #120, 12 ; 0x7800000
Disassembly of section .bss:
00008038 <x>:
8038: 00000000 andeq r0, r0, r0
so digging more into toolchain specific stuff we can now know either the start and end of .bss or can use math in the linker script to get size and length. From which you can write a small loop that zeros that memory (in assembly language of course, chicken and egg, in the bootstrap before you branch to the C entry point of your program).
Now say for some reason you wanted .data at some other address 0x10000000
.globl _start
_start:
mov sp,#0x8000
bl notmain
b .
MEMORY
{
bob : ORIGIN = 0x10000000, LENGTH = 0x1000
ted : ORIGIN = 0x8000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > ted
.rodata : { *(.rodata*) } > ted
.bss : { *(.bss*) } > ted
.data : { *(.data*) } > bob
}
00008000 <_start>:
8000: e3a0d902 mov sp, #32768 ; 0x8000
8004: eb000000 bl 800c <notmain>
8008: eafffffe b 8008 <_start+0x8>
0000800c <notmain>:
800c: e59f3010 ldr r3, [pc, #16] ; 8024 <notmain+0x18>
8010: e5933000 ldr r3, [r3]
8014: e59f200c ldr r2, [pc, #12] ; 8028 <notmain+0x1c>
8018: e2833007 add r3, r3, #7
801c: e5823000 str r3, [r2]
8020: e12fff1e bx lr
8024: 10000000 andne r0, r0, r0
8028: 0000802c andeq r8, r0, r12, lsr #32
Disassembly of section .bss:
0000802c <x>:
802c: 00000000 andeq r0, r0, r0
Disassembly of section .data:
10000000 <y>:
10000000: 12345678 eorsne r5, r4, #120, 12 ; 0x7800000
so what is the kernel.img or -O binary format? it is just a memory image starting at the lowest address (0x8000 in this case) and filled OR PADDED to the highest address, in this case 0x10000003, so it is a 0x10000004-0x8000 byte file.
00000000 02 d9 a0 e3 00 00 00 eb fe ff ff ea 10 30 9f e5 |.............0..|
00000010 00 30 93 e5 0c 20 9f e5 07 30 83 e2 00 30 82 e5 |.0... ...0...0..|
00000020 1e ff 2f e1 00 00 00 10 2c 80 00 00 00 00 00 00 |../.....,.......|
00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
0fff8000 78 56 34 12 |xV4.|
0fff8004
That is a massive waste of disk space for this program, they padded the hell out of that. Now if for some reason you wanted to do something like this, various reasons (that generally do not apply to bare metal on the pi), you could do this instead:
MEMORY
{
bob : ORIGIN = 0x10000000, LENGTH = 0x1000
ted : ORIGIN = 0x8000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > ted
.rodata : { *(.rodata*) } > ted
.bss : { *(.bss*) } > ted
.data : { *(.data*) } > bob AT > ted
}
00000000 02 d9 a0 e3 00 00 00 eb fe ff ff ea 10 30 9f e5 |.............0..|
00000010 00 30 93 e5 0c 20 9f e5 07 30 83 e2 00 30 82 e5 |.0... ...0...0..|
00000020 1e ff 2f e1 00 00 00 10 2c 80 00 00 00 00 00 00 |../.....,.......|
00000030 78 56 34 12 |xV4.|
00000034
Disassembly of section .bss:
0000802c <x>:
802c: 00000000 andeq r0, r0, r0
Disassembly of section .data:
10000000 <y>:
10000000: 12345678 eorsne r5, r4, #120, 12 ; 0x7800000
what it has done is the code is compiled and linked for .data at 0x10000000 but the binary that you carry around and load has the .data data bundled up tight, it is the job of the bootstrap to copy that data to its correct landing spot of 0x10000000 and again you have to use toolchain specific linker scripty stuff
.globl _start
_start:
mov sp,#0x8000
bl notmain
b .
linker_stuff:
.word data_start
.word data_end
MEMORY
{
bob : ORIGIN = 0x10000000, LENGTH = 0x1000
ted : ORIGIN = 0x8000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > ted
.rodata : { *(.rodata*) } > ted
.bss : { *(.bss*) } > ted
data_start = .;
.data : { *(.data*) } > bob AT > ted
data_end = .;
}
0000800c <linker_stuff>:
800c: 00008038 andeq r8, r0, r8, lsr r0
8010: 10000004 andne r0, r0, r4
and clearly that didnt quite work so you have to do more linker scripy stuff to figure it out.
there is no good reason to need any of this for the raspberry pi, at best if you have .bss and dont have any .data and/or you put .bss last if you have a lot of it, then you can either take advantage of the toolchain accidentally zero padding and solving the .bss problem for you or if that is too big of a binary then you can see above how to find the .bss offset and size then add the few lines of code to zero it (ultimately costing load time either way, but not costing sd card space).
where you definitely need to learn such things is for when you are on a microcontroller where the non-volatile is treated as read-only flash, if you choose to program with a style that requires .data and/or .bss and you assume those items are implemented then you have to do the toolchain specific work to link then zero and/or copy from non-volatile flash to read/write ram before branching into the first or only C entry point of your application.
I am sure someone could come up with reasons to not pack a pi bare metal binary up nice and neat, there is always an exception...but for now you dont need to worry about those exceptions, put .bss first then .data and always make sure you have a .data item even if it is something you never use.
I have a C code that calls a function defined in ARM Assembly. Two Parameters have to be passed.
If the function call looks like this:
functionName(a, b)
the registers x0 and x1 hold these values in which order? Is it x0 holds a and x1 holds b or the other way round?
It took longer to ask the question than to just try it.
extern void bar ( unsigned int, unsigned int );
void foo ( void )
{
bar(5,7);
}
compile then disassemble
traditional arm
00000000 <foo>:
0: e3a00005 mov r0, #5
4: e3a01007 mov r1, #7
8: eafffffe b 0 <bar>
aarch64
0000000000000000 <foo>:
0: 528000e1 mov w1, #0x7 // #7
4: 528000a0 mov w0, #0x5 // #5
8: 14000000 b 0 <bar>
c: d503201f nop
msp430
00000000 <foo>:
0: 3e 40 07 00 mov #7, r14 ;#0x0007
4: 3f 40 05 00 mov #5, r15 ;#0x0005
8: b0 12 00 00 call #0x0000
c: 30 41 ret
pdp-11
00000000 <_foo>:
0: 1166 mov r5, -(sp)
2: 1185 mov sp, r5
4: 15e6 0007 mov $7, -(sp)
8: 15e6 0005 mov $5, -(sp)
c: 09f7 fff0 jsr pc, 0 <_foo>
10: 65c6 0004 add $4, sp
14: 1585 mov (sp)+, r5
16: 0087 rts pc
Is there a logical reason GCC (4.4.7) is not moving the byte from a structure into %eax directly, or is it just an optimization oversight?
Consider the following program:
struct foo { unsigned char x; };
struct bar { unsigned int x; };
int foo (const struct foo *x, int y) { return x->x * y; }
int bar (const struct bar *x, int y) { return x->x * y; }
When compiling with GCC, foo() and bar() differ more substantially than I expected:
foo:
.LFB0:
.cfi_startproc
movzbl (%rdi), %edx
movl %esi, %eax
imull %edx, %eax
ret
.cfi_endproc
bar:
.LFB1:
.cfi_startproc
movl (%rdi), %eax
imull %esi, %eax
ret
.cfi_endproc
I expected foo() would be just like bar(), except using a different move instruction.
I will note that under clang-500.2.79, the compiler generates the code I expect for foo(), and foo() and bar() have the same number of instructions (as I had expected for GCC as well, but was wrong).
Since you multiply an uchar x and a uint y in the function foo, the compiler must promote uchar x to int first, which the instruction movzbl just does.
See the explanation of movz instructions here.
Afterward I recompiled your code with gcc 4.6.1 and -O3 options, I got assembles as follows:
foo:
.LFB34:
.cfi_startproc
movzbl (%rdi), %eax
imull %esi, %eax
ret
.cfi_endproc
bar:
.LFB35:
.cfi_startproc
movl (%rdi), %eax
imull %esi, %eax
ret
.cfi_endproc
It doesn't use %edx any more.
The short answer
Why will GCC copy word into the return register but not byte?
Because you asked it to return a word not a byte. The compilers did what they were asked based on your code. You asked for a size promotion in one case and unsigned to signed in both cases. There was more than one way to do that and clang/llvm and gcc happened to vary.
Is there a logical reason GCC (4.4.7) is not moving the byte from a structure into %eax directly, or is it just an optimization oversight?
I think based on what we see in the current compilers it was an oversight. See generated code below. (-O2 used in each case).
Interesting experiments related to this question.
clang
0000000000000000 <foo>:
0: 0f b6 07 movzbl (%rdi),%eax
3: 0f af c6 imul %esi,%eax
6: c3 retq
0000000000000010 <bar>:
10: 0f af 37 imul (%rdi),%esi
13: 89 f0 mov %esi,%eax
15: c3 retq
gcc
0000000000000000 <foo>:
0: 0f b6 07 movzbl (%rdi),%eax
3: 0f af c6 imul %esi,%eax
6: c3 retq
0000000000000010 <bar>:
10: 8b 07 mov (%rdi),%eax
12: 0f af c6 imul %esi,%eax
15: c3 retq
They both generated proper code. The tiny difference in the number of bytes of instruction could have really gone either way with these small functions on this instruction set.
Your compiler at the time must not have seen that optimization for some reason.
mips:
00000000 <foo>:
0: 90820000 lbu v0,0(a0)
4: 00000000 nop
8: 00450018 mult v0,a1
c: 00001012 mflo v0
10: 03e00008 jr ra
14: 00000000 nop
00000018 <bar>:
18: 8c820000 lw v0,0(a0)
1c: 00000000 nop
20: 00a20018 mult a1,v0
24: 00001012 mflo v0
28: 03e00008 jr ra
2c: 00000000 nop
arm
00000000 <foo>:
0: e5d00000 ldrb r0, [r0]
4: e0000091 mul r0, r1, r0
8: e12fff1e bx lr
0000000c <bar>:
c: e5900000 ldr r0, [r0]
10: e0000091 mul r0, r1, r0
14: e12fff1e bx lr
No big surprise there like x86 the difference is in the load and how it deals with the other 24 bits then as the code said it promotes the unsigned char or int to signed integer and then multiply and return a signed integer.
Another equally interesting example to complement your question.
struct foo { unsigned char x; };
struct bar { unsigned int x; };
char foo (const struct foo *x, char y) { return x->x * y; }
char bar (const struct bar *x, char y) { return x->x * y; }
clang
0000000000000000 <foo>:
0: 8a 07 mov (%rdi),%al
2: 40 f6 e6 mul %sil
5: 0f be c0 movsbl %al,%eax
8: c3 retq
0000000000000010 <bar>:
10: 0f af 37 imul (%rdi),%esi
13: 40 0f be c6 movsbl %sil,%eax
17: c3 retq
gcc
0000000000000000 <foo>:
0: 89 f0 mov %esi,%eax
2: f6 27 mulb (%rdi)
4: c3 retq
0000000000000010 <bar>:
10: 89 f0 mov %esi,%eax
12: f6 27 mulb (%rdi)
14: c3 retq
gcc arm
00000000 <foo>:
0: e5d00000 ldrb r0, [r0]
4: e0010190 mul r1, r0, r1
8: e20100ff and r0, r1, #255 ; 0xff
c: e12fff1e bx lr
00000010 <bar>:
10: e5900000 ldr r0, [r0]
14: e0010190 mul r1, r0, r1
18: e20100ff and r0, r1, #255 ; 0xff
1c: e12fff1e bx lr
mips
00000000 <foo>:
0: 90820000 lbu v0,0(a0)
4: 00052e00 sll a1,a1,0x18
8: 00052e03 sra a1,a1,0x18
c: 00a20018 mult a1,v0
10: 00001012 mflo v0
14: 00021600 sll v0,v0,0x18
18: 03e00008 jr ra
1c: 00021603 sra v0,v0,0x18
00000020 <bar>:
20: 8c820000 lw v0,0(a0)
24: 00052e00 sll a1,a1,0x18
28: 00052e03 sra a1,a1,0x18
2c: 00a20018 mult a1,v0
30: 00001012 mflo v0
34: 00021600 sll v0,v0,0x18
38: 03e00008 jr ra
3c: 00021603 sra v0,v0,0x18
That code in particular punished mips.
and lastly
struct foo { unsigned char x; };
struct bar { unsigned int x; };
unsigned char foo (const struct foo *x, unsigned char y) { return x->x * y; }
unsigned char bar (const struct bar *x, unsigned char y) { return x->x * y; }
gcc and clang for x86 produce the same code as above with the non-specified chars, but
arm
00000000 <foo>:
0: e5d00000 ldrb r0, [r0]
4: e0010190 mul r1, r0, r1
8: e20100ff and r0, r1, #255 ; 0xff
c: e12fff1e bx lr
00000010 <bar>:
10: e5900000 ldr r0, [r0]
14: e0010190 mul r1, r0, r1
18: e20100ff and r0, r1, #255 ; 0xff
1c: e12fff1e bx lr
mips
00000000 <foo>:
0: 90820000 lbu v0,0(a0)
4: 30a500ff andi a1,a1,0xff
8: 00a20018 mult a1,v0
c: 00001012 mflo v0
10: 03e00008 jr ra
14: 304200ff andi v0,v0,0xff
00000018 <bar>:
18: 8c820000 lw v0,0(a0)
1c: 30a500ff andi a1,a1,0xff
20: 00a20018 mult a1,v0
24: 00001012 mflo v0
28: 03e00008 jr ra
2c: 304200ff andi v0,v0,0xff
Masking needed because of a combination of calling convention and instruction set. A punishment on both of these instruction sets...You will see this often when using variables whose size do not match the register size for instruction sets like these. x86 has a much wider array of instruction choices, the costs for x86 is the power (watts) that that additional logic costs.
For grins, even if you go way way back, the register sized choice is slightly cheaper.
00000000 <_foo>:
0: 1166 mov r5, -(sp)
2: 1185 mov sp, r5
4: 9f40 0004 movb *4(r5), r0
8: 45c0 ff00 bic $-400, r0
c: 1001 mov r0, r1
e: 7075 0006 mul 6(r5), r1
12: 1040 mov r1, r0
14: 1585 mov (sp)+, r5
16: 0087 rts pc
00000018 <_bar>:
18: 1166 mov r5, -(sp)
1a: 1185 mov sp, r5
1c: 1d41 0006 mov 6(r5), r1
20: 707d 0004 mul *4(r5), r1
24: 1040 mov r1, r0
26: 1585 mov (sp)+, r5
28: 0087 rts pc
compiler versions
gcc --version
gcc (Ubuntu/Linaro 4.8.1-10ubuntu9) 4.8.1
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
clang --version
clang version 3.4 (branches/release_34 201060)
Target: x86_64-unknown-linux-gnu
Thread model: posix
arm-none-eabi-gcc --version
arm-none-eabi-gcc (GCC) 4.8.2
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
mips-elf-gcc --version
mips-elf-gcc (GCC) 4.8.2
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
And that last instruction set is an exercise for the reader, there is a bit of a clue in the disassembly...