I'm compiling some test code for an embedded ARM chip using a test C file.
test.c is as follows:
int main(){
*(int*)0xFFFFF400 = 7;
}
I compile the file with the following command
arm-none-eabi-gcc -march=armv4 -mtune=arm7tdmi -specs=nosys.specs -Wall -o test test.c
Which compiles without complaint, then I examine the assembly with
arm-none-eabi-objdump -d ./test
Which produces a long output with the following main() section:
00008018 <main>:
8018: e3e03000 mvn r3, #0
801c: e3a02007 mov r2, #7
8020: e3a00000 mov r0, #0
8024: e5032bff str r2, [r3, #-3071] ; 0xfffff401
8028: e1a0f00e mov pc, lr
Why does it say 0xfffff401 instead of 0xfffff400? Why is it subtracting 3071 instead of 3072?
The mvn instruction writes the bitwise inverse of its operand to a register. The bitwise inverse of 0 is all 1 bits, which, in two’s complement, represents −1. Then the address [r3, #-3071] is −1 + −3071 = −3072.
I do not know why the compiler is choosing to base its addressing off −1 rather than 0.
Related
(I am new to the ARM world. Excuse me if this is a dumb question.)
I am using below command line to generate assembly code for a C file.
The cpu is arm926ej-s, which is ARMv5 architecture.
arm-none-eabi-gcc -mcpu=arm926ej-s -mthumb -S t.c -o t_thumb.S
arm-none-eabi-gcc -mcpu=arm926ej-s -marm -S t.c -o t_arm.S
I am expecting the -marm and -mthumb options would generate different function prologues. But they give similar results:
for -marm:
# args = 0, pretend = 0, frame = 72
# frame_needed = 1, uses_anonymous_args = 0
push {fp, lr} #<========== push is used instead of stmfd
add fp, sp, #4
sub sp, sp, #72
bl uart_init
for -mthumb:
# args = 0, pretend = 0, frame = 72
# frame_needed = 1, uses_anonymous_args = 0
push {r7, lr} #<========== push is used as expected
sub sp, sp, #72
add r7, sp, #0
bl uart_init
So they both use the push instruction. But as I checked the ARMv5 arch spec, the push instruction only belongs to the Thumb instruction set. I was expecting stmfd for the -marm option.
Why is push chosen instead?
How can I generate pure ARM instructions?
ADD 1 - 5:21 PM 12/18/2019
Below is the disassembly of the .o file:
arm-none-eabi-gcc -mcpu=arm926ej-s -marm -g -c t.c -o build/t_arm.o
arm-none-eabi-objdump.exe -d build/t_arm.o > t_arm.dism
The disassembly:
000002a0 <main>:
2a0: e92d4800 push {fp, lr} <=============== push is used!
2a4: e28db004 add fp, sp, #4
2a8: e24dd048 sub sp, sp, #72 ; 0x48
2ac: ebfffffe bl 0 <uart_init>
2b0: e59f3168 ldr r3, [pc, #360] ; 420 <main+0x180>
2b4: e50b300c str r3, [fp, #-12]
2b8: e59f1164 ldr r1, [pc, #356] ; 424 <main+0x184>
2bc: e51b000c ldr r0, [fp, #-12]
ADD 2 - 5:34 PM 12/18/2019
Thanks to #Erlkoenig.
I just tried to disassemble a -mthumb binary:
arm-none-eabi-gcc -mcpu=arm926ej-s -mthumb -g -c t.c -o build/t_thumb.o
arm-none-eabi-objdump.exe -d build/t_thumb.o > t_thumb.dism
A totally different thumb disassembly is shown:
00000170 <main>:
170: b580 push {r7, lr} <====== though still push is shown, but the encoding is different.
172: b092 sub sp, #72 ; 0x48
174: af00 add r7, sp, #0
176: f7ff fffe bl 0 <uart_init>
17a: 4b3c ldr r3, [pc, #240] ; (26c <main+0xfc>)
17c: 643b str r3, [r7, #64] ; 0x40
17e: 4a3c ldr r2, [pc, #240] ; (270 <main+0x100>)
180: 6c3b ldr r3, [r7, #64] ; 0x40
The hex encoding of the raw instruction as shown by objdump -d indicates that this is a 32bit ARM ("A32") instruction (0xe92d4800). The .S file generated by the -S flag to GCC, and the objdump output just use the ARM UAL (Unified Assembly Syntax), which uses push as an alias for stmfd, while the ARMv5T Architecture Reference Manual uses the old syntax, which has no push on A32. The instruction encoding matches the encoding of stmdb, for which stmfd is an alias. The encoding is shown on p. 339 in the ARMv5T Reference Manual.
A32 ("ARM") code can be easily recognized as all instructions are 4-byte wide and the first 4 bits are often hex E (which means that the condition code is AL, i.e. the instructions are always executed unconditionally):
[e]92d4800
[e]28db004
[e]24dd048
[e]bfffffe
This is helpful when viewing raw binaries in a hex editor. Thumb ("T32") code has many 16bit instructions, some 32bit, and no "stacks" of Es:
b580
b092
af00
f7ff fffe
Of course, for a raw binary, it is not directly clear which 2- and 4-byte groups belong together as instructions.
I'm trying to manually issue ARMv7 movt and movw instructions for a cpu feature test. I'm catching a compile error with Clang.
The test program is below. According to the ARM folks, .inst.w is the way to do this. It handles big-endian and little-endian properly, and places the code in the .text section instead of a data section.
$ cat test.cxx
int test()
{
int a;
asm volatile (
".inst.w 0xf2412334 \n\t" // movw r3, 0x1234
".inst.w 0xf2c12334 \n\t" // movt r3, 0x1234
"mov %0, r3 \n\t" // mov [a], r3
: "=r" (a) : : "r3");
return a;
}
GCC is fine:
$ g++ -O1 -march=armv7-a test.cxx -c
$ objdump --disassemble test.o
...
00000000 <_Z4testv>:
0: f241 2334 movw r3, #4660 ; 0x1234
4: f2c1 2334 movt r3, #4660 ; 0x1234
8: 4618 mov r0, r3
a: 4770 bx lr
However, Clang:
$ clang++ -O1 -march=armv7-a test.cxx -c
test.cxx:5:2: error: width suffixes are invalid in ARM mode
".inst.w 0xf2412334 \n\t" // movw r3, 0x1234
^
<inline asm>:1:2: note: instantiated into assembly here
.inst.w 0xf2412334
^
test.cxx:5:25: error: width suffixes are invalid in ARM mode
".inst.w 0xf2412334 \n\t" // movw r3, 0x1234
^
<inline asm>:2:2: note: instantiated into assembly here
.inst.w 0xf2c12334
^
2 errors generated.
If I change .inst.w to .inst, then Clang produces garbage:
$ clang++ -O1 -march=armv7-a test.cxx -c
$ objdump --disassemble test.o
...
00000000 <_Z4testv>:
0: f2412334 vcge.s8 d18, d1, d20
4: f2c12334 vbic.i32 d18, #5120 ; 0x00001400
8: e1a00003 mov r0, r3
c: e12fff1e bx lr
I verified Clang is defining __GNUC__, so it should be able to consume this code.
How do I get Clang to assemble the movt and movw instructions?
The main difference is that your GCC is configured to default to thumb mode, while clang isn't.
ARM has got two different 32 bit instruction sets, ARM and Thumb, and even if the instruction names are similar, the encodings are different. The ARM instruction set encodes all instructions as fixed length 32 bit instructions, while Thumb originally was a much smaller instruction set with all instructions being 16 bit. Since Thumb2 (which is the case for ARMv7), the instructions can either be a single 16 bit instruction or a pair of two 16 bit instructions.
The disassembly you showed indicates this:
0: f241 2334 movw r3, #4660 ; 0x1234
4: f2c1 2334 movt r3, #4660 ; 0x1234
8: 4618 mov r0, r3
a: 4770 bx lr
The latter two instructions are plain 16 bit opcodes (4618 and 4770), while the former two are two pairs of 16 bits (f241 2334 and f2c1 2334) separated with whitespace.
The clang disassembly however doesn't split the opcodes in half, and have full 32 bit opcodes for all instructions:
0: f2412334 vcge.s8 d18, d1, d20
4: f2c12334 vbic.i32 d18, #5120 ; 0x00001400
8: e1a00003 mov r0, r3
c: e12fff1e bx lr
In this case, passing -mthumb to Clang should get the same behaviour as GCC, and vice versa, passing -marm to GCC should reproduce the same failure there.
The .w suffix to .inst is to indicate that the value should be handled as a wide 32 bit instruction (as opposed to a narrow 16 bit one), which only makes sense in Thumb mode. IIRC, both GCC (since some time) and Clang (since release 8) should be able to deduce the kind of Thumb instruction without the .w suffix as well.
Instead of forcing the compiler to one mode or another, you probably want something like this instead though:
asm volatile (
#ifdef __thumb__
".inst.w 0xf2412334 \n\t" // movw r3, 0x1234
".inst.w 0xf2c12334 \n\t" // movt r3, 0x1234
#else
".inst 0xe3013234 \n\t" // movw r3, 0x1234
".inst 0xe3413234 \n\t" // movt r3, 0x1234
#endif
"mov %0, r3 \n\t" // mov [a], r3
: "=r" (a) : : "r3");
I compare code generated by clang and generated by gcc for arm.
Unfortunately, gcc's code more often has less instructions.
I am just curious - is it possible to reduce code, generated by clang?
Maybe I should use some options to do so...
Please, consider very simple example:
> cat test.c
int to_upper(int c)
{
if(c < 'a' || c > 'z') return c;
else return c - ('a' - 'A');
}
> clang -target arm-none-eabi -Oz -c -mthumb -mcpu=cortex-m0 -msoft-float ./test.c -o ./clang_test.o
> /usr/bin/arm-none-eabi-gcc -Os -c -mthumb -mcpu=cortex-m0 -msoft-float ./test.c -o ./gcc_test.o
> /usr/bin/arm-none-eabi-objdump -d ./clang_test.o
./clang_test.o: file format elf32-littlearm
Disassembly of section .text:
00000000 <to_upper>:
0: 4602 mov r2, r0
2: 3a61 subs r2, #97 ; 0x61
4: 4601 mov r1, r0
6: 3920 subs r1, #32
8: 2a19 cmp r2, #25
a: d800 bhi.n e <to_upper+0xe>
c: 4608 mov r0, r1
e: 4770 bx lr
> /usr/bin/arm-none-eabi-objdump -d ./gcc_test.o
./gcc_test.o: file format elf32-littlearm
Disassembly of section .text:
00000000 <to_upper>:
0: 1c03 adds r3, r0, #0
2: 3b61 subs r3, #97 ; 0x61
4: 2b19 cmp r3, #25
6: d800 bhi.n a <to_upper+0xa>
8: 3820 subs r0, #32
a: 4770 bx lr
Why so much difference in such simple code?
Can clang generate less code in this case? At least as gcc?
Note: if we change cpu to -mcpu=cortex-a5 (other options remains the same), then clang ang gcc produce
absolutely identical code:
00000000 <to_upper>:
0: f1a0 0361 sub.w r3, r0, #97 ; 0x61
4: 2b19 cmp r3, #25
6: bf98 it ls
8: 3820 subls r0, #32
a: 4770 bx lr
OS: Ubuntu 14.04.3
clang version 3.7.1 (tags/RELEASE_371/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
arm-none-eabi-gcc (4.8.2-14ubuntu1+6) 4.8.2
No, clang cannot generate less code in this case. And also in many others.
Historically, very few code size optimizations have been implemented in LLVM. When optimizing for code size, GCC typically outperforms LLVM significantly.
Here presentation, where done a closer look at the comparing GCC and Clang in terms of code size optimization.
Presentation video
I've been programming in C and C++ for quite a long time now, so I'm familiar with the linking process as a user: the preprocessor expands all prototypes and macros in each .c file which is then compiled separately into its own object file, and all object files together with static libraries are linked into an executable.
However I'd like to know more about this process: how does the linker link the object files (what do they contain anyway?)? Matching declared but undefined functions with their definitions in other files (how?)? Translating into the exact content of the program memory (context: microcontrollers)?
Application example
Ideally, I'm looking for a detailed step-by-step description of what the process is doing, based on the following simplistic example. Since it doesn't appear to be said anywhere, fame and glory to whoever answers in this way.
main.c
#include "otherfile.h"
int main(void) {
otherfile_print("Foo");
return 0;
}
otherfile.h
void otherfile_print(char const *);
otherfile.c
#include "otherfile.h"
#include <stdio.h>
void otherfile_print(char const *str) {
printf(str);
}
printf is insanely complicated, very bad for a microcontroller hello world example, blinking leds are better but that gets specific to the microcontroller. this will suffice for linking.
two.c
unsigned int glob;
unsigned int two ( unsigned int a, unsigned int b )
{
glob=5;
return(a+b+7);
}
one.c
extern unsigned int glob;
unsigned int two ( unsigned int, unsigned int );
unsigned int one ( void )
{
return(two(5,6)+glob);
}
start.s
.globl _start
_start:
bl one
b .
build everything.
% arm-none-eabi-gcc -O2 -c one.c -o one.o
% arm-none-eabi-gcc -O2 -c two.c -o two.o
% touch start.s
% arm-none-eabi-gcc -Wall -O2 -nostdlib -nostartfiles -ffreestanding -c one.c -o one.o
% arm-none-eabi-gcc -Wall -O2 -nostdlib -nostartfiles -ffreestanding -c two.c -o two.o
% arm-none-eabi-as start.s -o start.o
% arm-none-eabi-ld -Ttext=0x10000000 start.o one.o two.o -o onetwo.elf
now lets look...
arm-none-eabi-objdump -D start.o
...
00000000 <_start>:
0: ebfffffe bl 0 <one>
4: eafffffe b 4 <_start+0x4>
it not is the compiler/assemblers job to deal with external references so the branch link to one is left incomplete, they chose to make it a bl of 0 but they could have simply left it totally unencoded, it is up to the authors of the toolchain as to how to communicate between the compiler, assembler, and linker via object files.
Same here
00000000 <one>:
0: e92d4008 push {r3, lr}
4: e3a00005 mov r0, #5
8: e3a01006 mov r1, #6
c: ebfffffe bl 0 <two>
10: e59f300c ldr r3, [pc, #12] ; 24 <one+0x24>
14: e5933000 ldr r3, [r3]
18: e0800003 add r0, r0, r3
1c: e8bd4008 pop {r3, lr}
20: e12fff1e bx lr
24: 00000000 andeq r0, r0, r0
both the function two and the address for the global variable glob are unknown. Note that for the unknown variable the compiler generates code that requires the explicit address of the global so that the linker simply needs to fill in the address, also glob is .data not .text.
00000000 <two>:
0: e59f3010 ldr r3, [pc, #16] ; 18 <two+0x18>
4: e2811007 add r1, r1, #7
8: e3a02005 mov r2, #5
c: e0810000 add r0, r1, r0
10: e5832000 str r2, [r3]
14: e12fff1e bx lr
18: 00000000 andeq r0, r0, r0
here too the global is in .data not here, so the linker will have to place .data and the things in it and then fill in the addresses.
so here we have linked it all together, the gnu linker requires an entry point label defined _start (main is an extern address required by the standard bootstrap, which I am not using so we dont get a main not found error). Because I am not using a linker script the gnu linker places items in the binary in the order they were defined on the command line, as desired i need start first for a microcontroller since I am controlling the boot. I used a non-zero here for demonstration purposes as well...
10000000 <_start>:
10000000: eb000000 bl 10000008 <one>
10000004: eafffffe b 10000004 <_start+0x4>
10000008 <one>:
10000008: e92d4008 push {r3, lr}
1000000c: e3a00005 mov r0, #5
10000010: e3a01006 mov r1, #6
10000014: eb000005 bl 10000030 <two>
10000018: e59f300c ldr r3, [pc, #12] ; 1000002c <one+0x24>
1000001c: e5933000 ldr r3, [r3]
10000020: e0800003 add r0, r0, r3
10000024: e8bd4008 pop {r3, lr}
10000028: e12fff1e bx lr
1000002c: 1000804c andne r8, r0, ip, asr #32
10000030 <two>:
10000030: e59f3010 ldr r3, [pc, #16] ; 10000048 <two+0x18>
10000034: e2811007 add r1, r1, #7
10000038: e3a02005 mov r2, #5
1000003c: e0810000 add r0, r1, r0
10000040: e5832000 str r2, [r3]
10000044: e12fff1e bx lr
10000048: 1000804c andne r8, r0, ip, asr #32
Disassembly of section .bss:
1000804c <__bss_start>:
1000804c: 00000000 andeq r0, r0, r0
so the linker starts to place the first item start.o, it roughly figures out how big that needs to be by just putting what was there. those two instructions. they take 8 bytes so in theory the second item one.o goes next at 0x10000008. That means the encoding for the bl one in start.s can be completed to use the correct relative address (_start + 8 which is the value of the pc when executing so the offset is zero, pc+0 is the encoding)
the linker has roughly placed one.o into the binary it is building and it has to resolve the address to two and the global so it has to place two.o and then figure out where the end of that is to place in this case .bss not .data since I didnt pre-init the variable.
the label for two is at 0x10000030 so it encodes the bl two in one() for that relative offset, it has also placed glob at 1000804c for some reason (I didnt complete define where ram was so the gnu linker will do things like this). Despite the reason, that is where the linker defined the home for that global variable and where the address to glob is needed is filled in by the linker, both one() and two() needed those filled in.
So the compiler (assembler) and linker have to in the end result in a usable binary, the compiler (assembler) tend to worry about making position independent machine code and leave enough information for the linker so that it has the machine code and a list of unresolved externs that it has to fill in. compilers have improved over time, a simple model would be to have an address location like they did above for the global variables address, where the linker computes the absolute address and just fills it in, clearly above they did not encode the function call in a way that it can use an absolute address to one and two. instead it uses pc relative addressing. This means that the linker has to know the machine code encoding of the bl instruction. the current generation of gnu linker knows quite a bit more and can do some cool things resolving arm to thumb and back, stuff it didnt used to know (you dont need to compile for thumb interwork anymore the linker takes care of it).
So the linker takes binary blobs including data and...links them together into one binary. It first needs to know the actual addresses for the various things in the binary. How you tell the linker this is linker specific and not a global thing for all C/C++ toolchains. Gnu linker scripts are a programming language in and of themselves. These are not necessarily physical nor virtual addresses it is simply the address space of the code in whatever mode it is in (virtual or physical). Once the linker knows the addresses it, based on linker rules (again linker specific) it starts placing these various binary blobs into those address spaces. then it goes through and resolves the external/global addresses. It was not above but can be an iterative process. If for example the function two() was at an address in memory that cannot be accessed with a single pc relative instruction (say we put one near zero and two near 0xF0000000) then those that wrote the linker have two choices, the simple choice is to simply state that it cannot encode/implement that far of a branch and bail out and gnu linker did or still does do that. Or the other solution is the linker fixes the problem. the linker could add a few words of data within the range of the pc relative branch link and those few words of data are a trampoline for example an absolute address that is loaded into a register then a register based branch or perhaps of clever a pc relative branch if the trampoline is within range (in the case of 0x10000000 to 0xF0000000 that wouldnt work). If the linker has to add these few words then that may mean that some of the binary blobs have to move to make room for those few words and now all of the addresses in those binary blobs now have to move as well. So you have to make another pass across all the binary blobs, resolving all of the new addresses filling in the answers and for pc relative determining if you can still reach everything. Adding those few words might have made something that was reachable with a pc-relative now unreachable and now that requires a solution (error or patch).
The assembler itself for a single source file has to go through even more of these gyrations esp for a variable length instruction set like x86 where the addressing is a big vague. I recommend trying for yourself to make a simple assembler that only supports a few instructions but some of those branches. and parse and encode the instructions and compare that to an existing debugged assembler like gnu assembler.
test.s
ldr r1,locdat
nop
nop
nop
nop
nop
b over
locdat: .word 0x12345678
top:
nop
nop
nop
nop
nop
nop
over:
b top
the right answer is
00000000 <locdat-0x1c>:
0: e59f1014 ldr r1, [pc, #20] ; 1c <locdat>
4: e1a00000 nop ; (mov r0, r0)
8: e1a00000 nop ; (mov r0, r0)
c: e1a00000 nop ; (mov r0, r0)
10: e1a00000 nop ; (mov r0, r0)
14: e1a00000 nop ; (mov r0, r0)
18: ea000006 b 38 <over>
0000001c <locdat>:
1c: 12345678 eorsne r5, r4, #120, 12 ; 0x7800000
00000020 <top>:
20: e1a00000 nop ; (mov r0, r0)
24: e1a00000 nop ; (mov r0, r0)
28: e1a00000 nop ; (mov r0, r0)
2c: e1a00000 nop ; (mov r0, r0)
30: e1a00000 nop ; (mov r0, r0)
34: e1a00000 nop ; (mov r0, r0)
00000038 <over>:
38: eafffff8 b 20 <top>
there are parallels to that activity and the job of a linker. also you could fashion a simple linker based on the above files or something similar, extract the binary blobs and other info and start placing them in whatever address space you want.
Either one are fairly simple programming tasks, yet fairly educational. Having an existing toolchain that can produce the answer you can figure out where you are going wrong or how to get at the right answer.
I was writing a loadable kernel module and when trying to compile it, the linker fails with the following message:
*** Warning: "__floatsidf" [/testing/Something.ko] undefined!
I am NOT using floating point variables, so that's not it. What is the cause of such errors?
Note: I'm using Linux ubuntu kernel v. 3.5.0-23-generic
__floatsidf is a runtime routine to convert a 32-bit signed integer into a double precision floating point number. Somewhere in your project is a line that looks like:
double foo = bar;
Or something similar, where bar is a 32-bit integer. It could also be that you're calling one of the libm functions (or any other function, really) that expects a double with an integer parameter:
foo = pow(bar, baz);
where either bar or baz (or both) is an integer.
Without showing some code, there's not much more we can do to help.
To narrow it down, check the object files your compiler generates (before you link them) and look in the disassembly for a reference to that symbol - that should tell you what function it's happening in.
Here's an example of what I mean. First up - source code:
#include <math.h>
int function(int x, int y)
{
return pow(x, y);
}
Pretty straightforward, right? Now, I'm going to compile it for ARM and disassemble:
$ clang -arch arm -O2 -c -o example.o example.c
$ otool -tV example.o
example.o:
(__TEXT,__text) section
_function:
00000000 e92d40f0 push {r4, r5, r6, r7, lr}
00000004 e28d700c add r7, sp, #12
00000008 e1a04001 mov r4, r1
0000000c ebfffffb bl ___floatsidf
00000010 e1a05000 mov r5, r0
00000014 e1a00004 mov r0, r4
00000018 e1a06001 mov r6, r1
0000001c ebfffff7 bl ___floatsidf
00000020 e1a02000 mov r2, r0
00000024 e1a03001 mov r3, r1
00000028 e1a00005 mov r0, r5
0000002c e1a01006 mov r1, r6
00000030 ebfffff2 bl _pow
00000034 ebfffff1 bl ___fixdfsi
00000038 e8bd40f0 pop {r4, r5, r6, r7, lr}
0000003c e12fff1e bx lr
Look at that - two calls to __floatsidf and one to __fixdfsi, matching the two conversions of x and y to double and then the conversion of the return type back to int.
You are using floating point somewhere - your module includes a conversion from int to double. It might be as simple as calling a function that takes a double parameter and passing an int.
You could try searching your code for "double".
You could try compiling your module to assembly code and looking at that to find which function uses __floatsidf.
Remember that the use of double might be in a header file, possibly one written by someone else.