Parameter passing convention for Function call from ARM Assembly to C - c

I have a C code that calls a function defined in ARM Assembly. Two Parameters have to be passed.
If the function call looks like this:
functionName(a, b)
the registers x0 and x1 hold these values in which order? Is it x0 holds a and x1 holds b or the other way round?

It took longer to ask the question than to just try it.
extern void bar ( unsigned int, unsigned int );
void foo ( void )
{
bar(5,7);
}
compile then disassemble
traditional arm
00000000 <foo>:
0: e3a00005 mov r0, #5
4: e3a01007 mov r1, #7
8: eafffffe b 0 <bar>
aarch64
0000000000000000 <foo>:
0: 528000e1 mov w1, #0x7 // #7
4: 528000a0 mov w0, #0x5 // #5
8: 14000000 b 0 <bar>
c: d503201f nop
msp430
00000000 <foo>:
0: 3e 40 07 00 mov #7, r14 ;#0x0007
4: 3f 40 05 00 mov #5, r15 ;#0x0005
8: b0 12 00 00 call #0x0000
c: 30 41 ret
pdp-11
00000000 <_foo>:
0: 1166 mov r5, -(sp)
2: 1185 mov sp, r5
4: 15e6 0007 mov $7, -(sp)
8: 15e6 0005 mov $5, -(sp)
c: 09f7 fff0 jsr pc, 0 <_foo>
10: 65c6 0004 add $4, sp
14: 1585 mov (sp)+, r5
16: 0087 rts pc

Related

Difference between position dependent and position independent code?

I understand that the current gcc compilers by default generate position independent code. However, to get an understanding of how position dependent code looked like, I compiled this
int Add(int x, int y) {
return x+y;
}
int Subtract(int x, int y) {
return x-y;
}
int main() {
bool flag = false;
int x=10,y=5,z;
if (flag) {
z = Add(x,y);
}
else {
z = Subtract(x,y);
}
}
as g++ -c check.cpp -no-pie. However, the generated code is identical with or without the -no-pie flag. <main+0x34> looks to be a relative offset.
26: 55 push %rbp
27: 48 89 e5 mov %rsp,%rbp
2a: 48 83 ec 10 sub $0x10,%rsp
2e: c6 45 f3 00 movb $0x0,-0xd(%rbp)
32: c7 45 f4 0a 00 00 00 movl $0xa,-0xc(%rbp)
39: c7 45 f8 05 00 00 00 movl $0x5,-0x8(%rbp)
40: 80 7d f3 00 cmpb $0x0,-0xd(%rbp)
44: 74 14 je 5a <main+0x34>
46: 8b 55 f8 mov -0x8(%rbp),%edx
49: 8b 45 f4 mov -0xc(%rbp),%eax
4c: 89 d6 mov %edx,%esi
4e: 89 c7 mov %eax,%edi
50: e8 00 00 00 00 callq 55 <main+0x2f>
55: 89 45 fc mov %eax,-0x4(%rbp)
58: eb 12 jmp 6c <main+0x46>
5a: 8b 55 f8 mov -0x8(%rbp),%edx
5d: 8b 45 f4 mov -0xc(%rbp),%eax
60: 89 d6 mov %edx,%esi
62: 89 c7 mov %eax,%edi
64: e8 00 00 00 00 callq 69 <main+0x43>
69: 89 45 fc mov %eax,-0x4(%rbp)
6c: b8 00 00 00 00 mov $0x0,%eax
71: c9 leaveq
72: c3 retq
is the objdump in both cases for just the main. Am I not using the correct flag or is the assembly code supposed to be same for PIC and non-PIC for this code chunk. If it is supposed to be the same, could you please provide a snippet for which it isn't!
You have to access items that are outside the module or section to see a difference.
unsigned int x;
void fun ( void )
{
x = 5;
}
so this crosses over .text to .data.
position dependent.
00000000 <fun>:
0: e3a02005 mov r2, #5
4: e59f3004 ldr r3, [pc, #4] ; 10 <fun+0x10>
8: e5832000 str r2, [r3]
c: e12fff1e bx lr
10: 00000000
position independent
00000000 <fun>:
0: e3a02005 mov r2, #5
4: e59f3010 ldr r3, [pc, #16] ; 1c <fun+0x1c>
8: e59f1010 ldr r1, [pc, #16] ; 20 <fun+0x20>
c: e08f3003 add r3, pc, r3
10: e7933001 ldr r3, [r3, r1]
14: e5832000 str r2, [r3]
18: e12fff1e bx lr
1c: 00000008
20: 00000000
In the first case the linker will fill in the address to the memory location
8: e5832000 str r2, [r3]
c: e12fff1e bx lr
10: 00000000 <--- here
the pc relative addressing from 4: to 10: is within the .text section so dependent or independent are fine.
4: e59f3004 ldr r3, [pc, #4] ; 10 <fun+0x10>
8: e5832000 str r2, [r3]
c: e12fff1e bx lr
10: 00000000
it gets the address to the external entity, filled in by the linker, and then directly access that item at that address.
4: e59f3010 ldr r3, [pc, #16] ; 1c <fun+0x1c>
8: e59f1010 ldr r1, [pc, #16] ; 20 <fun+0x20>
c: e08f3003 add r3, pc, r3
10: e7933001 ldr r3, [r3, r1]
14: e5832000 str r2, [r3]
18: e12fff1e bx lr
1c: 00000008
20: 00000000
is easier to see linked (-Ttext=0x1000 -Tdata=0x2000)
00001000 <fun>:
1000: e3a02005 mov r2, #5
1004: e59f3010 ldr r3, [pc, #16] ; 101c <fun+0x1c>
1008: e59f1010 ldr r1, [pc, #16] ; 1020 <fun+0x20>
100c: e08f3003 add r3, pc, r3
1010: e7933001 ldr r3, [r3, r1]
1014: e5832000 str r2, [r3]
1018: e12fff1e bx lr
101c: 00010010
1020: 0000000c
Disassembly of section .got:
00011024 <_GLOBAL_OFFSET_TABLE_>:
...
11030: 00002000
Disassembly of section .bss:
00002000 <x>:
2000: 00000000
(clearly I should have also specified where the GOT goes).
While the global offset table and .bss are different sections once linked they are fixed relative to each other. What position independence gives is the ability to move .bss (or .data, etc) relative to .text. So if you think about the position dependent solution, if .data were to move and you had say 1000 references sprinkled all through the binary, in order to move .bss you would have to patch every one of those.
Instead the global offset table here provides a single location where the address of the variable x lives, and all access to variable x will essentially use double indirection to access. It may not be obvious but a position dependent way to get at a table like this would be for the linker to fill in its absolute address, but that would not be independent and this was compiled to be independent so pc relative math has to be done to find the global offset table, so for this instruction set when executing the instruction at 0x100c the program counter is 0x100c+8.
100c: e08f3003 add r3, pc, r3
So we are adding 0x100C+8+0x00010010 = 0x11024 and adding 0x0000000c to that giving 0x11030. So compute the address to the GOT then the offset within that, and THAT gives us the address to the item. 0x2000. So you do the second indirection there to get at the item.
If you were to place .text at an address other than 0x1000 but don't move .bss that is fine this will all work so long that the GOT moves to the same relative offset from .text. If you were to leave .text but move .bss then you have to update the GOT, if you move .bss from 0x2000 to 0x3000 then that is a difference of +0x1000 so you then go through the GOT and add 0x1000 to each item to cover that difference.
Position independence essentially has to do double indirection instead of single indirection (or one more level than would have been needed for position dependent) in order to access distant items or items not position dependent relative to .text. Which means more code, more memory access. It is more code and slower.
For it to work .text reaching out to other .text items cant use fixed addresses it has to use indirect/computed addresses. Likewise the GOT as used here (by GNU) has to be at a fixed relative position to .text. Then from there you can move data relative to code and still access it. So you have to have some rules. .text being code and assumed read only cant support this offset table which needs to be in ram, so it cant simply be built into the .text section.

Why do I get different result using log() function in C?

Here is a simple example of log() function test:
#include <stdio.h>
#include <math.h>
int main(void)
{
int a = 2;
printf("int a = %d, log((double)a) = %g, log(2.0) = %g\n", a, log((double)a), log(2.0));
return 0;
}
I get difference on Raspberry Pi 3 and Ubuntu16.04:
arm-linux-gnueabi-gcc
$ arm-linux-gnueabi-gcc -mfloat-abi=soft -march=armv7-a foo.c -o foo -lm
$ ./foo
int a = 2, log((double)a) = 5.23028e-314, log(2.0) = 0.693147
arm-linux-gnueabihf-gcc
$ arm-linux-gnueabihf-gcc -march=armv7-a foo.c -o foo -lm
$ ./foo
int a = 2, log((double)a) = 0.693147, log(2.0) = 0.693147
gcc
$ gcc foo.c -o foo -lm
$ ./foo
int a = 2, log((double)a) = 0.693147, log(2.0) = 0.693147
The standard distribution of Raspbian uses the hardware floating point support of the Raspberry Pi (Raspbian FAQ) which is not fully compatible with the different approach of using a software library to emulate floating point computation using integers only.
You can tell the type of your Raspbian distribution by looking for the directory /lib/arm-linux-gnueabihf for the hard-float version and /lib/arm-linux-gnueabi (How can I tell...) for the soft-float one.
As Pascal Cuoq noted in one of the comments to this question, it might be of interest to know that the reason for the correct result of log(2.0) in all examples is called constant folding. The compiler is allowed to compute every result at compile time—if possible—for optimization purposes. This might be an unwanted behaviour if you have for example different rounding modes in your code. GCC has -frounding-math to switch of constant folding (among other things), although it might not catch everything, so be careful here.
Not able to repeat the issue. Where is your disassembly to show the value fed to printf?
#include <math.h>
double fun1 ( void )
{
return(log(2));
}
double fun2 ( void )
{
return(log(2.0));
}
00000000 <fun1>:
0: e30309ef movw r0, #14831 ; 0x39ef
4: e3021e42 movw r1, #11842 ; 0x2e42
8: e34f0efa movt r0, #65274 ; 0xfefa
c: e3431fe6 movt r1, #16358 ; 0x3fe6
10: e12fff1e bx lr
00000014 <fun2>:
14: e30309ef movw r0, #14831 ; 0x39ef
18: e3021e42 movw r1, #11842 ; 0x2e42
1c: e34f0efa movt r0, #65274 ; 0xfefa
20: e3431fe6 movt r1, #16358 ; 0x3fe6
24: e12fff1e bx lr
00000000 <fun1>:
0: ed9f 0b01 vldr d0, [pc, #4] ; 8 <fun1+0x8>
4: 4770 bx lr
6: bf00
8: fefa39ef
c: 3fe62e42
00000010 <fun2>:
10: ed9f 0b01 vldr d0, [pc, #4] ; 18 <fun2+0x8>
14: 4770 bx lr
16: bf00
18: fefa39ef
1c: 3fe62e42
0000000000000000 <fun1>:
0: f2 0f 10 05 00 00 00 movsd 0x0(%rip),%xmm0 # 8 <fun1+0x8>
7: 00
8: c3 retq
0000000000000010 <fun2>:
10: f2 0f 10 05 00 00 00 movsd 0x0(%rip),%xmm0 # 18 <fun2+0x8>
17: 00
18: c3 retq
0000000000000000 <.LC0>:
0: ef
1: 39 fa
3: fe 42 2e
6: e6 3f
Now causing an int to float conversion vs building in the float version (2) vs (2.0) as well as adding in (2.0F). Compile time or runtime can cause differences.
Start by eliminating the printf, divide this problem in half, am I seeing some printf thing or not printf thing. then is this a compile time thing or is this a runtime thing, is this a hard float thing or a soft float thing. Is this a c library thing or not a C library thing.
What if anything have you done so far to debug this?
Eventually someone is going to link the "whatever programmer should know about floating point" whether it applies or not...
EDIT
#include <math.h>
double fun ( void )
{
return(log(2.0));
}
00000000 <fun>:
0: e52db004 push {fp} ; (str fp, [sp, #-4]!)
4: e28db000 add fp, sp, #0
8: e30329ef movw r2, #14831 ; 0x39ef
c: e34f2efa movt r2, #65274 ; 0xfefa
10: e3023e42 movw r3, #11842 ; 0x2e42
14: e3433fe6 movt r3, #16358 ; 0x3fe6
18: ec432b17 vmov d7, r2, r3
1c: eeb00b47 vmov.f64 d0, d7
20: e24bd000 sub sp, fp, #0
24: e49db004 pop {fp} ; (ldr fp, [sp], #4)
28: e12fff1e bx lr
00000000 <fun>:
0: e52db004 push {fp} ; (str fp, [sp, #-4]!)
4: e28db000 add fp, sp, #0
8: e30329ef movw r2, #14831 ; 0x39ef
c: e34f2efa movt r2, #65274 ; 0xfefa
10: e3023e42 movw r3, #11842 ; 0x2e42
14: e3433fe6 movt r3, #16358 ; 0x3fe6
18: e1a00002 mov r0, r2
1c: e1a01003 mov r1, r3
20: e24bd000 sub sp, fp, #0
24: e49db004 pop {fp} ; (ldr fp, [sp], #4)
28: e12fff1e bx lr
well there goes the notion of constant folding explaining why to calls to log() give vastly different results. (arguably a different version of the toolchain (or different command line arguments) you could just get lucky, so far we dont know what version of the toolchains, build options, etc were used to be able to repeat this).
EDIT 2
#include <math.h>
double fun ( void )
{
return(log(2));
}
00000000 <fun>:
0: e52db004 push {fp} ; (str fp, [sp, #-4]!)
4: e28db000 add fp, sp, #0
8: e30329ef movw r2, #14831 ; 0x39ef
c: e34f2efa movt r2, #65274 ; 0xfefa
10: e3023e42 movw r3, #11842 ; 0x2e42
14: e3433fe6 movt r3, #16358 ; 0x3fe6
18: ec432b17 vmov d7, r2, r3
1c: eeb00b47 vmov.f64 d0, d7
20: e24bd000 sub sp, fp, #0
24: e49db004 pop {fp} ; (ldr fp, [sp], #4)
28: e12fff1e bx lr
00000000 <fun>:
0: e52db004 push {fp} ; (str fp, [sp, #-4]!)
4: e28db000 add fp, sp, #0
8: e30329ef movw r2, #14831 ; 0x39ef
c: e34f2efa movt r2, #65274 ; 0xfefa
10: e3023e42 movw r3, #11842 ; 0x2e42
14: e3433fe6 movt r3, #16358 ; 0x3fe6
18: e1a00002 mov r0, r2
1c: e1a01003 mov r1, r3
20: e24bd000 sub sp, fp, #0
24: e49db004 pop {fp} ; (ldr fp, [sp], #4)
28: e12fff1e bx lr
around 60 seconds worth of work to contemplate constant folding maybe being a factor, so far it doesnt apply, but there is potential dumb luck there, but the same dumb luck could/would apply to both calls to log
A few seconds of work by the OP to disassemble that program would quickly cover this side topic.

Are C compilers like gcc smart enough to bit shifts in place?

Unlike assembly code, in C there is no way to bit shift a value in place. To shift the bits in variable an assignment must always be performed:
x = x << 3;
Are compilers like gcc smart enough to realize that this is an in-place bit shift and compile it like this:
shl x, 3
or will the compiler put the result first in a register, then move it back into x (which would require two extra unnecessary instructions).
Any good compiler with optimization turned on will handle bit shifts efficiently.
Compilers will keep small objects in registers when feasible and efficient and will not store them to memory even if you write assignment statements, until they are forced to by circumstances.
Additionally, it is not desirable on typical modern processors to try to shift the bits of a value in memory. Generally, memory hardware does not have any capability to manipulate stored values. To change the value of something in memory, it must be moved to the processor (loaded), changed, and moved back (stored). Whether this is done in one instruction or several is not generally an indication of how fast or efficient it is, because the processor still has to do the individual load, shift, store operations, and the performance of those is highly dependent on the processor model.
Except in exceptional programming situations, you should not be worrying about performance at this level.
what did you see when you tried it? why not just try it?
unsigned int fun ( unsigned int x )
{
return (x<<3);
}
Disassembly of section .text:
00000000 <fun>:
0: e1a00180 lsl r0, r0, #3
4: e12fff1e bx lr
Disassembly of section .text:
00000000 <_fun>:
0: 1166 mov r5, -(sp)
2: 1185 mov sp, r5
4: 1d40 0004 mov 4(r5), r0
8: 0cc0 asl r0
a: 0cc0 asl r0
c: 0cc0 asl r0
e: 1585 mov (sp)+, r5
10: 0087 rts pc
Disassembly of section .text:
0000000000000000 <fun>:
0: 531d7000 lsl w0, w0, #3
4: d65f03c0 ret
Disassembly of section .text:
0000000000000000 <fun>:
0: 8d 04 fd 00 00 00 00 lea 0x0(,%rdi,8),%eax
7: c3 retq
00000000 <fun>:
0: 42 18 0c 5c rpt #3 { rlax.w r12 ;
4: 30 41 ret
Disassembly of section .text:
00000000 <fun>:
0: 050e slli x10,x10,0x3
2: 8082 ret
unsigned int x;
void fun ( void )
{
x=x<<3;
}
Disassembly of section .text:
00000000 <fun>:
0: e59f200c ldr r2, [pc, #12] ; 14 <fun+0x14>
4: e5923000 ldr r3, [r2]
8: e1a03183 lsl r3, r3, #3
c: e5823000 str r3, [r2]
10: e12fff1e bx lr
14: 00000000 andeq r0, r0, r0
and so on

Bare metal programming Raspberry Pi 3.

I was going through some bare metal programming tutorials. While reading about C code execution I came to know that we need to setup C execution environment like initializing stack zeroing bss etc.
In some cases you have to copy data in ram , and need to provide startup code for that as well. Link of tutorial which says copy data in RAM.
Now I have two doubts.
If we need to copy data in RAM then why don't we copy code ie text segment. If we don't copy text segment doest it mean code is executed from SD card itself in case of Raspberry pi 3(Arm embedded processor).
When we specify linker script like below, does it suggest to copy those section in RAM or these sections will be mapped in RAM address?
Sorry I am really confuse.
MEMORY
{
ram : ORIGIN = 0x8000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > ram
.bss : { *(.bss*) } > ram
}
Any help is appreciated.
vectors.s
.globl _start
_start:
mov sp,#0x8000
bl notmain
b .
notmain.c
unsigned int x;
unsigned int y=0x12345678;
void notmain ( void )
{
x=y+7;
}
memmap
MEMORY
{
bob : ORIGIN = 0x80000000, LENGTH = 0x1000
ted : ORIGIN = 0x8000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > ted
.rodata : { *(.rodata*) } > ted
.bss : { *(.bss*) } > ted
.data : { *(.data*) } > ted
}
build
arm-none-eabi-as --warn --fatal-warnings vectors.s -o vectors.o
arm-none-eabi-gcc -Wall -Werror -O2 -nostdlib -nostartfiles -ffreestanding -c notmain.c -o notmain.o
arm-none-eabi-ld vectors.o notmain.o -T memmap -o notmain.elf
arm-none-eabi-objdump -D notmain.elf > notmain.list
arm-none-eabi-objcopy notmain.elf -O binary kernel.img
you can add/remove options, and name it the right kernelX.img (and if you are venturing into 64 bit then use aarch64-whatever-gcc instead of arm-whatever-gcc...
Looking at the dissassembly
Disassembly of section .text:
00008000 <_start>:
8000: e3a0d902 mov sp, #32768 ; 0x8000
8004: eb000000 bl 800c <notmain>
8008: eafffffe b 8008 <_start+0x8>
0000800c <notmain>:
800c: e59f3010 ldr r3, [pc, #16] ; 8024 <notmain+0x18>
8010: e5933000 ldr r3, [r3]
8014: e59f200c ldr r2, [pc, #12] ; 8028 <notmain+0x1c>
8018: e2833007 add r3, r3, #7
801c: e5823000 str r3, [r2]
8020: e12fff1e bx lr
8024: 00008030 andeq r8, r0, r0, lsr r0
8028: 0000802c andeq r8, r0, r12, lsr #32
Disassembly of section .bss:
0000802c <x>:
802c: 00000000 andeq r0, r0, r0
Disassembly of section .data:
00008030 <y>:
8030: 12345678 eorsne r5, r4, #120, 12 ; 0x7800000
and comparing that to the kernelX.img file
hexdump -C kernel.img
00000000 02 d9 a0 e3 00 00 00 eb fe ff ff ea 10 30 9f e5 |.............0..|
00000010 00 30 93 e5 0c 20 9f e5 07 30 83 e2 00 30 82 e5 |.0... ...0...0..|
00000020 1e ff 2f e1 30 80 00 00 2c 80 00 00 00 00 00 00 |../.0...,.......|
00000030 78 56 34 12 |xV4.|
00000034
Note that because I put .data after .bss in the linker script it put them in that order in the image. there are four bytes of zeros after the last word in .text and the 0x12345678 of .data
If you swap the positions of .bss and .data in the linker script
0000802c <y>:
802c: 12345678 eorsne r5, r4, #120, 12 ; 0x7800000
Disassembly of section .bss:
00008030 <x>:
8030: 00000000 andeq r0, r0, r0
00000000 02 d9 a0 e3 00 00 00 eb fe ff ff ea 10 30 9f e5 |.............0..|
00000010 00 30 93 e5 0c 20 9f e5 07 30 83 e2 00 30 82 e5 |.0... ...0...0..|
00000020 1e ff 2f e1 2c 80 00 00 30 80 00 00 78 56 34 12 |../.,...0...xV4.|
00000030
Ooops, no freebie. Now .bss is not zeroed and you would need to zero it in your bootstrap (if you have a .bss area and as a programming style you assume those items are zero when you first use them).
Okay so how do you find where .bss is? well that is what the tutorial and countless others are showing you.
.globl _start
_start:
mov sp,#0x8000
bl notmain
b .
linker_stuff:
.word hello_world
.word world_hello
MEMORY
{
bob : ORIGIN = 0x80000000, LENGTH = 0x1000
ted : ORIGIN = 0x8000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > ted
.rodata : { *(.rodata*) } > ted
.data : { *(.data*) } > ted
hello_world = .;
.bss : { *(.bss*) } > ted
world_hello = .;
}
build and disassemble
Disassembly of section .text:
00008000 <_start>:
8000: e3a0d902 mov sp, #32768 ; 0x8000
8004: eb000002 bl 8014 <notmain>
8008: eafffffe b 8008 <_start+0x8>
0000800c <linker_stuff>:
800c: 00008038 andeq r8, r0, r8, lsr r0
8010: 0000803c andeq r8, r0, r12, lsr r0
00008014 <notmain>:
8014: e59f3010 ldr r3, [pc, #16] ; 802c <notmain+0x18>
8018: e5933000 ldr r3, [r3]
801c: e59f200c ldr r2, [pc, #12] ; 8030 <notmain+0x1c>
8020: e2833007 add r3, r3, #7
8024: e5823000 str r3, [r2]
8028: e12fff1e bx lr
802c: 00008034 andeq r8, r0, r4, lsr r0
8030: 00008038 andeq r8, r0, r8, lsr r0
Disassembly of section .data:
00008034 <y>:
8034: 12345678 eorsne r5, r4, #120, 12 ; 0x7800000
Disassembly of section .bss:
00008038 <x>:
8038: 00000000 andeq r0, r0, r0
so digging more into toolchain specific stuff we can now know either the start and end of .bss or can use math in the linker script to get size and length. From which you can write a small loop that zeros that memory (in assembly language of course, chicken and egg, in the bootstrap before you branch to the C entry point of your program).
Now say for some reason you wanted .data at some other address 0x10000000
.globl _start
_start:
mov sp,#0x8000
bl notmain
b .
MEMORY
{
bob : ORIGIN = 0x10000000, LENGTH = 0x1000
ted : ORIGIN = 0x8000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > ted
.rodata : { *(.rodata*) } > ted
.bss : { *(.bss*) } > ted
.data : { *(.data*) } > bob
}
00008000 <_start>:
8000: e3a0d902 mov sp, #32768 ; 0x8000
8004: eb000000 bl 800c <notmain>
8008: eafffffe b 8008 <_start+0x8>
0000800c <notmain>:
800c: e59f3010 ldr r3, [pc, #16] ; 8024 <notmain+0x18>
8010: e5933000 ldr r3, [r3]
8014: e59f200c ldr r2, [pc, #12] ; 8028 <notmain+0x1c>
8018: e2833007 add r3, r3, #7
801c: e5823000 str r3, [r2]
8020: e12fff1e bx lr
8024: 10000000 andne r0, r0, r0
8028: 0000802c andeq r8, r0, r12, lsr #32
Disassembly of section .bss:
0000802c <x>:
802c: 00000000 andeq r0, r0, r0
Disassembly of section .data:
10000000 <y>:
10000000: 12345678 eorsne r5, r4, #120, 12 ; 0x7800000
so what is the kernel.img or -O binary format? it is just a memory image starting at the lowest address (0x8000 in this case) and filled OR PADDED to the highest address, in this case 0x10000003, so it is a 0x10000004-0x8000 byte file.
00000000 02 d9 a0 e3 00 00 00 eb fe ff ff ea 10 30 9f e5 |.............0..|
00000010 00 30 93 e5 0c 20 9f e5 07 30 83 e2 00 30 82 e5 |.0... ...0...0..|
00000020 1e ff 2f e1 00 00 00 10 2c 80 00 00 00 00 00 00 |../.....,.......|
00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
0fff8000 78 56 34 12 |xV4.|
0fff8004
That is a massive waste of disk space for this program, they padded the hell out of that. Now if for some reason you wanted to do something like this, various reasons (that generally do not apply to bare metal on the pi), you could do this instead:
MEMORY
{
bob : ORIGIN = 0x10000000, LENGTH = 0x1000
ted : ORIGIN = 0x8000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > ted
.rodata : { *(.rodata*) } > ted
.bss : { *(.bss*) } > ted
.data : { *(.data*) } > bob AT > ted
}
00000000 02 d9 a0 e3 00 00 00 eb fe ff ff ea 10 30 9f e5 |.............0..|
00000010 00 30 93 e5 0c 20 9f e5 07 30 83 e2 00 30 82 e5 |.0... ...0...0..|
00000020 1e ff 2f e1 00 00 00 10 2c 80 00 00 00 00 00 00 |../.....,.......|
00000030 78 56 34 12 |xV4.|
00000034
Disassembly of section .bss:
0000802c <x>:
802c: 00000000 andeq r0, r0, r0
Disassembly of section .data:
10000000 <y>:
10000000: 12345678 eorsne r5, r4, #120, 12 ; 0x7800000
what it has done is the code is compiled and linked for .data at 0x10000000 but the binary that you carry around and load has the .data data bundled up tight, it is the job of the bootstrap to copy that data to its correct landing spot of 0x10000000 and again you have to use toolchain specific linker scripty stuff
.globl _start
_start:
mov sp,#0x8000
bl notmain
b .
linker_stuff:
.word data_start
.word data_end
MEMORY
{
bob : ORIGIN = 0x10000000, LENGTH = 0x1000
ted : ORIGIN = 0x8000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > ted
.rodata : { *(.rodata*) } > ted
.bss : { *(.bss*) } > ted
data_start = .;
.data : { *(.data*) } > bob AT > ted
data_end = .;
}
0000800c <linker_stuff>:
800c: 00008038 andeq r8, r0, r8, lsr r0
8010: 10000004 andne r0, r0, r4
and clearly that didnt quite work so you have to do more linker scripy stuff to figure it out.
there is no good reason to need any of this for the raspberry pi, at best if you have .bss and dont have any .data and/or you put .bss last if you have a lot of it, then you can either take advantage of the toolchain accidentally zero padding and solving the .bss problem for you or if that is too big of a binary then you can see above how to find the .bss offset and size then add the few lines of code to zero it (ultimately costing load time either way, but not costing sd card space).
where you definitely need to learn such things is for when you are on a microcontroller where the non-volatile is treated as read-only flash, if you choose to program with a style that requires .data and/or .bss and you assume those items are implemented then you have to do the toolchain specific work to link then zero and/or copy from non-volatile flash to read/write ram before branching into the first or only C entry point of your application.
I am sure someone could come up with reasons to not pack a pi bare metal binary up nice and neat, there is always an exception...but for now you dont need to worry about those exceptions, put .bss first then .data and always make sure you have a .data item even if it is something you never use.

Why will GCC copy word into the return register but not byte?

Is there a logical reason GCC (4.4.7) is not moving the byte from a structure into %eax directly, or is it just an optimization oversight?
Consider the following program:
struct foo { unsigned char x; };
struct bar { unsigned int x; };
int foo (const struct foo *x, int y) { return x->x * y; }
int bar (const struct bar *x, int y) { return x->x * y; }
When compiling with GCC, foo() and bar() differ more substantially than I expected:
foo:
.LFB0:
.cfi_startproc
movzbl (%rdi), %edx
movl %esi, %eax
imull %edx, %eax
ret
.cfi_endproc
bar:
.LFB1:
.cfi_startproc
movl (%rdi), %eax
imull %esi, %eax
ret
.cfi_endproc
I expected foo() would be just like bar(), except using a different move instruction.
I will note that under clang-500.2.79, the compiler generates the code I expect for foo(), and foo() and bar() have the same number of instructions (as I had expected for GCC as well, but was wrong).
Since you multiply an uchar x and a uint y in the function foo, the compiler must promote uchar x to int first, which the instruction movzbl just does.
See the explanation of movz instructions here.
Afterward I recompiled your code with gcc 4.6.1 and -O3 options, I got assembles as follows:
foo:
.LFB34:
.cfi_startproc
movzbl (%rdi), %eax
imull %esi, %eax
ret
.cfi_endproc
bar:
.LFB35:
.cfi_startproc
movl (%rdi), %eax
imull %esi, %eax
ret
.cfi_endproc
It doesn't use %edx any more.
The short answer
Why will GCC copy word into the return register but not byte?
Because you asked it to return a word not a byte. The compilers did what they were asked based on your code. You asked for a size promotion in one case and unsigned to signed in both cases. There was more than one way to do that and clang/llvm and gcc happened to vary.
Is there a logical reason GCC (4.4.7) is not moving the byte from a structure into %eax directly, or is it just an optimization oversight?
I think based on what we see in the current compilers it was an oversight. See generated code below. (-O2 used in each case).
Interesting experiments related to this question.
clang
0000000000000000 <foo>:
0: 0f b6 07 movzbl (%rdi),%eax
3: 0f af c6 imul %esi,%eax
6: c3 retq
0000000000000010 <bar>:
10: 0f af 37 imul (%rdi),%esi
13: 89 f0 mov %esi,%eax
15: c3 retq
gcc
0000000000000000 <foo>:
0: 0f b6 07 movzbl (%rdi),%eax
3: 0f af c6 imul %esi,%eax
6: c3 retq
0000000000000010 <bar>:
10: 8b 07 mov (%rdi),%eax
12: 0f af c6 imul %esi,%eax
15: c3 retq
They both generated proper code. The tiny difference in the number of bytes of instruction could have really gone either way with these small functions on this instruction set.
Your compiler at the time must not have seen that optimization for some reason.
mips:
00000000 <foo>:
0: 90820000 lbu v0,0(a0)
4: 00000000 nop
8: 00450018 mult v0,a1
c: 00001012 mflo v0
10: 03e00008 jr ra
14: 00000000 nop
00000018 <bar>:
18: 8c820000 lw v0,0(a0)
1c: 00000000 nop
20: 00a20018 mult a1,v0
24: 00001012 mflo v0
28: 03e00008 jr ra
2c: 00000000 nop
arm
00000000 <foo>:
0: e5d00000 ldrb r0, [r0]
4: e0000091 mul r0, r1, r0
8: e12fff1e bx lr
0000000c <bar>:
c: e5900000 ldr r0, [r0]
10: e0000091 mul r0, r1, r0
14: e12fff1e bx lr
No big surprise there like x86 the difference is in the load and how it deals with the other 24 bits then as the code said it promotes the unsigned char or int to signed integer and then multiply and return a signed integer.
Another equally interesting example to complement your question.
struct foo { unsigned char x; };
struct bar { unsigned int x; };
char foo (const struct foo *x, char y) { return x->x * y; }
char bar (const struct bar *x, char y) { return x->x * y; }
clang
0000000000000000 <foo>:
0: 8a 07 mov (%rdi),%al
2: 40 f6 e6 mul %sil
5: 0f be c0 movsbl %al,%eax
8: c3 retq
0000000000000010 <bar>:
10: 0f af 37 imul (%rdi),%esi
13: 40 0f be c6 movsbl %sil,%eax
17: c3 retq
gcc
0000000000000000 <foo>:
0: 89 f0 mov %esi,%eax
2: f6 27 mulb (%rdi)
4: c3 retq
0000000000000010 <bar>:
10: 89 f0 mov %esi,%eax
12: f6 27 mulb (%rdi)
14: c3 retq
gcc arm
00000000 <foo>:
0: e5d00000 ldrb r0, [r0]
4: e0010190 mul r1, r0, r1
8: e20100ff and r0, r1, #255 ; 0xff
c: e12fff1e bx lr
00000010 <bar>:
10: e5900000 ldr r0, [r0]
14: e0010190 mul r1, r0, r1
18: e20100ff and r0, r1, #255 ; 0xff
1c: e12fff1e bx lr
mips
00000000 <foo>:
0: 90820000 lbu v0,0(a0)
4: 00052e00 sll a1,a1,0x18
8: 00052e03 sra a1,a1,0x18
c: 00a20018 mult a1,v0
10: 00001012 mflo v0
14: 00021600 sll v0,v0,0x18
18: 03e00008 jr ra
1c: 00021603 sra v0,v0,0x18
00000020 <bar>:
20: 8c820000 lw v0,0(a0)
24: 00052e00 sll a1,a1,0x18
28: 00052e03 sra a1,a1,0x18
2c: 00a20018 mult a1,v0
30: 00001012 mflo v0
34: 00021600 sll v0,v0,0x18
38: 03e00008 jr ra
3c: 00021603 sra v0,v0,0x18
That code in particular punished mips.
and lastly
struct foo { unsigned char x; };
struct bar { unsigned int x; };
unsigned char foo (const struct foo *x, unsigned char y) { return x->x * y; }
unsigned char bar (const struct bar *x, unsigned char y) { return x->x * y; }
gcc and clang for x86 produce the same code as above with the non-specified chars, but
arm
00000000 <foo>:
0: e5d00000 ldrb r0, [r0]
4: e0010190 mul r1, r0, r1
8: e20100ff and r0, r1, #255 ; 0xff
c: e12fff1e bx lr
00000010 <bar>:
10: e5900000 ldr r0, [r0]
14: e0010190 mul r1, r0, r1
18: e20100ff and r0, r1, #255 ; 0xff
1c: e12fff1e bx lr
mips
00000000 <foo>:
0: 90820000 lbu v0,0(a0)
4: 30a500ff andi a1,a1,0xff
8: 00a20018 mult a1,v0
c: 00001012 mflo v0
10: 03e00008 jr ra
14: 304200ff andi v0,v0,0xff
00000018 <bar>:
18: 8c820000 lw v0,0(a0)
1c: 30a500ff andi a1,a1,0xff
20: 00a20018 mult a1,v0
24: 00001012 mflo v0
28: 03e00008 jr ra
2c: 304200ff andi v0,v0,0xff
Masking needed because of a combination of calling convention and instruction set. A punishment on both of these instruction sets...You will see this often when using variables whose size do not match the register size for instruction sets like these. x86 has a much wider array of instruction choices, the costs for x86 is the power (watts) that that additional logic costs.
For grins, even if you go way way back, the register sized choice is slightly cheaper.
00000000 <_foo>:
0: 1166 mov r5, -(sp)
2: 1185 mov sp, r5
4: 9f40 0004 movb *4(r5), r0
8: 45c0 ff00 bic $-400, r0
c: 1001 mov r0, r1
e: 7075 0006 mul 6(r5), r1
12: 1040 mov r1, r0
14: 1585 mov (sp)+, r5
16: 0087 rts pc
00000018 <_bar>:
18: 1166 mov r5, -(sp)
1a: 1185 mov sp, r5
1c: 1d41 0006 mov 6(r5), r1
20: 707d 0004 mul *4(r5), r1
24: 1040 mov r1, r0
26: 1585 mov (sp)+, r5
28: 0087 rts pc
compiler versions
gcc --version
gcc (Ubuntu/Linaro 4.8.1-10ubuntu9) 4.8.1
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
clang --version
clang version 3.4 (branches/release_34 201060)
Target: x86_64-unknown-linux-gnu
Thread model: posix
arm-none-eabi-gcc --version
arm-none-eabi-gcc (GCC) 4.8.2
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
mips-elf-gcc --version
mips-elf-gcc (GCC) 4.8.2
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
And that last instruction set is an exercise for the reader, there is a bit of a clue in the disassembly...

Resources