As an extension of this question: GCC compile and link raw output
I am trying to compile and link a piece of code with a custom __start. As a note, I do NOT require this to work on any known architecture, so compliance with any specification is not important, getting it to work consistently is.
I have a simple piece of assembly (which I got from a URL I can't find now).
.set noreorder /* so we can use delay slots explicitly */
.text
.globl main
.globl __start
.type __start,#function
.ent __start
__start:
jal main;
nop;
li $0,1;
.end __start
If I understand this correctly, all this does is call my main method, do a no-op in the branch-delay slot, then write the number 1 to register 0 (I know this violates the MIPS specification, it is intentional - it denotes completion of the code and is "caught" before it actually occurs).
However, when I use the mips ld to link this with an example piece of code using this command mips-linux-gnu-ld --section-start=.text=0 start.o main.o -o executable
I get some unusual output when viewed with objdump
00000000 <.pic.main>:
0: 3c190000 lui t9,0x0
4: 0800022b j 8ac <main>
8: 273908ac addiu t9,t9,2220
c: 00000000 nop
00000010 <__start>:
10: 0c000000 jal 0 <.pic.main>
14: 00000000 nop
18: 24000001 li zero,1
1c: 00000000 nop
.........
000008ac <main>
.........
No matter how trivial my test program, I always get the same .pic.main function. However, in some cases it appears above __start and in some cases below.
I would like to remove this "function" entirely, but failing that would like it to always appear AFTER the __start.
As a bonus, if anyone knows what this function is or why it occurs, I'd be intrigued.
It looks like a position-independent jumping code. The linker doesn't know where your things are going to be put, so it creates a PIC for all cases. A relative jump, or a jump using a register could solve the problem, although it wouldn't be the jump and link.
I would try using
-mrelax-pic-calls to turn PIC calls that are normally dispatched via register $t9 into direct calls. This is only possible if the linker can resolve the destination at link-time and if the destination is within range for a direct call.
mbranch-cost=num to set the cost of branches to roughly num “simple” instructions. "This cost is only a heuristic and is not guaranteed to produce consistent results across releases."
-mno-shared for not to generate code that is fully position-independent, and that can therefore be linked into shared libraries
-mno-embedded-pic
I'd put my money on one the first two.
Related
I could only find bits and pieces of information on the symbol _start, which is called from the target startup code in order to establish the C runtime environment. This would be necessary to ensure that all initialized global/static variables are properly loaded prior to branching to main().
In my case, I am using an MCU with an ARM Cortex-R4F core CPU. When the device resets, I implement all of the steps recommended by the MCU manufacturer then attempt to branch to the symbol _start using the following lines of code:
extern void _start(void);
_start();
I am using something similar to the following to link the program:
armeb-eabi-gcc-7.5.0" -marm -fno-exceptions -Og -ffunction-sections -fdata-sections -g -gdwarf-3 -gstrict-dwarf -Wall -mbig-endian -mcpu=cortex-r4 -Wl,-Map,"app_tms570_dev.map" --entry main -static -Wl,--gc-sections -Wl,--build-id=none -specs="nosys.specs" -o[OUTPUT FILE NAME HERE] [ALL OBJECT FILES HERE] -Wl,-T[LINKER COMMAND FILE NAME HERE]
My toolchain in this case is gcc-linaro-7.5.0-2019.12-i686-mingw32_armeb-eabi, which is being used since my MCU device is big-endian.
As I trace through the call to symbol _start, I can see my program branch to symbol _start then a few unexpected things happen.
First, there are a couple of places where the following instruction is called:
EF123456 svc #0x123456
This basically generates a software interrupt, which causes the program to branch to the software interrupt handler that I have configured for the device.
Secondly, the device eventually branches to __libc_init_array then _init. However, symbol _init does not contain any branch instruction and allows the program to flow into _fini, which also does not contain any branch instruction and allows the program to flow into whatever code was placed next in memory. This eventually causes some type of abort exception, as would be expected.
The disassembly associated with _init and _fini:
_init():
00003b00: E1A0C00D mov r12, r13
00003b04: E92DDFF8 push {r3, r4, r5, r6, r7, r8, r9, r10, r11, r12, r14, pc}
00003b08: E24CB004 sub r11, r12, #4
_fini():
00003b0c: E1A0C00D mov r12, r13
00003b10: E92DDFF8 push {r3, r4, r5, r6, r7, r8, r9, r10, r11, r12, r14, pc}
00003b14: E24CB004 sub r11, r12, #4
Based on some other documentation I read, I also attempted to call main() directly, but this just caused the program to jump to main() without initializing anything. I also tried to call symbol __main() similar to what is done when using the ARM Compiler in order to execute startup code, but this symbol is not found.
Note that this is for a bare-metal-ish system that does not use semihosting.
My question is: Is there a way to set up the system and call a function that will establish the C runtime environment automatically and branch to main() using the GCC linker?
For the time being, I have implemented my own function to initialize .data sections and the .bss sections are already being zeroed at reset using a built in feature of the MCU device.
Adding some more details here:
The specific MCU that I am using should not be relevant, particularly taking the following discussion into consideration.
First, I have already set up the exception vectors for the device in an assembler file:
.section .excvecs,"ax",%progbits
.type Exc_Vects, %object
.size Exc_Vects, .-Exc_Vects
// See DDI0363G, Table 3-6
Exc_Vects:
b c_int00 // Reset vector
b exc_undef // Undefined instruction
b exc_software // Software
b exc_prefetch // Pre-fetch abort
b exc_data // Data abort
b exc_invalid // Invalid vector
There are two instructions that follow for the IRQ and FIQ interrupts as well, but they are set according to the MCU datasheet. I have defined handlers for the undefined instruction, prefetch abort, data abort and invalid vector exceptions. For the software exception I use some assembly to jump to an address that can be changed at runtime. My startup sequence begins at c_int00. These have all been tested and work with no problems.
My reset handler takes care of all of the steps needed for initializing the MCU in accordance with the MCU datasheet. This include initializing CPU registers and the stack pointers, which are loaded using symbols from the linker file.
The toolchain that I am using, noted above, includes the C standard libraries and other libraries needed to compile and link my program with no problems. This includes the symbol _start that I mentioned previously.
From what I understand, the function _start typically wraps main(). Before it calls main() it initializes .bss and .data sections, configures the heap, as well as performing some other tasks to set up the environment. When main() returns, it performs some clean up tasks and branches to a designated exit() function. (Side note: _start is defined in newlib based on the source code that I downloaded from linaro).
There is some detail regarding this in a separate response here:
What is the use of _start() in C?
I have been using the ARM Compiler as an alternative for the same project. There, __main performs these functions. For the stack initialization, I basically provide it an empty hook function and for exit I provide it with a function that safely terminates the program should main() return for some reason. I am not sure if something like this is needed for GCC.
I would note that I have included option -specs="nosys.specs" without option -nostartfiles. My understanding is that this avoids implementing some of the functions that do not want to use in my application, such as I/O operations, but links the startup code.
I am not using the heap in my project as dynamic memory use is frowned upon, but I was hoping to be able to use the startup code primarily in order to avoid having to remember to initialize .data sections manually. Above I noted that my application is baremetal-ish. I am actually using an RTOS and have the memory partitioned into blocks so that I can use the device MPU.
So I am trying to write some code using x86 and I can't seem to get it to move contents of a register to a spot in memory.
The code is just this
global main
SECTION .DATA
var_i: DD 0
SECTION .TEXT
main:
push DWORD 4
pop EAX
mov [var_i], EAX
mov EAX, 0
ret
I am using nasm and gcc on the code.
The problem I am having is that whenever I try to move to the spot in memory it segfaults
What kind of system/object format are you using? I'm guessing you're using ELF on Linux or Unix, as that would explain your problem:
Section names in ELF are case sensitive, and most ELF-based OS's the special sections .text and .data are understood, but your sections .TEXT and .DATA have no meaning. As a result, they just get stuck into the executable after the other sections and get the same access permissions. If you're just linking the above code, that will be after the .fini section, so it will executable and read-only. So when you try to write to the variable, you get a segfault.
Change your code to use .data and .text as section names and it should work.
I'm trying to learn assembly by compiling simple functions and looking at the output.
I'm looking at calling functions in other libraries. Here's a toy C function that calls a function defined elsewhere:
void give_me_a_ptr(void*);
void foo() {
give_me_a_ptr("foo");
}
Here's the assembly produced by gcc:
$ gcc -Wall -Wextra -g -O0 -c call_func.c
$ objdump -d call_func.o
call_func.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <foo>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: bf 00 00 00 00 mov $0x0,%edi
9: e8 00 00 00 00 callq e <foo+0xe>
e: 90 nop
f: 5d pop %rbp
10: c3 retq
I was expecting something like call <give_me_a_ptr#plt>. Why is this jumping to a relative position before it even knows where give_me_a_ptr is defined?
I'm also puzzled by mov $0, %edi. This looks like it's passing a null pointer -- surely mov $address_of_string, %rdi would be correct here?
You're not building with symbol-interposition enabled (a side-effect of -fPIC), so the call destination address can potentially be resolved at link time to an address in another object file that is being statically linked into the same executable. (e.g. gcc foo.o bar.o).
However, if the symbol is only found in a library that you're dynamically linking to (gcc foo.o -lbar), the call has to be indirected through the PLT to support.
Now this is the tricky part: without -fPIC or -fPIE, gcc still emits asm that calls the function directly:
int puts(const char*); // puts exists in libc, so we can link this example
void call_puts(void) { puts("foo"); }
# gcc 5.3 -O3 (without -fPIC)
movl $.LC0, %edi # absolute 32bit addressing: slightly smaller code, because static data is known to be in the low 2GB, in the default "small" code model
jmp puts # tail-call optimization. Same as call puts/ret, except for stack alignment
But if you look at the linked binary:
(on this Godbolt compiler explorer link, click the "binary" button to toggle between gcc -S asm output and objdump -dr disassembly)
# disassembled linker output
mov $0x400654,%edi
jmpq 400490 <puts#plt>
During linking, the call to puts was "magically" replaced with indirection through puts#plt, and a puts#plt definition is present in the linked executable.
I don't know the details of how this works, but it's done at link time when linking to a shared library. Crucially, it doesn't require anything in the header files to mark the function prototype as being in a shared library. You get the same results from including <stdio.h> as you do from declaring puts yourself. (This is highly not recommended; it's probably legal for a C implementation to only work properly with the declarations in headers. It happens to work on Linux, though.)
When compiling a position-independent executable (with -fPIE), the linked binary jumps to puts through the PLT, identically to without -fPIC. However, the compiler asm output is different (try it yourself on the godbolt link above):
call_puts: # compiled with -fPIE
leaq .LC0(%rip), %rdi # RIP-relative addressing for static data
jmp puts#PLT
The compiler forces indirection through the PLT for any calls to functions it can't see the definition for. I don't understand why. In PIE mode, we're compiling code for an executable, not a shared library. The linker should be able to link multiple object files into a position-independent executable with direct calls between functions defined in the executable. I'm testing on Linux (my desktop and godbolt), not OS X, where I assume gcc -fPIE is the default. It might be configured differently, IDK.
With -fPIC instead of -fPIE, things are even worse: even calls to global functions defined within the same compilation unit have to go through the PLT, to support symbol interposition. (e.g. LD_PRELOAD=intercept_some_functions.so ./a.out)
The differences between -fPIC and -fPIE are mainly that PIE can assume no symbol interposition for functions in the same compilation unit, but PIC can't. OS X requires position-independent executables, as well as shared libraries, but there is a difference in what the compiler can do when making code for a library vs. making code for an executable.
This Godbolt example has some more functions that demonstrate stuff about PIC and PIE mode, e.g. that call_puts() can't inline into another function in PIC mode, only PIE.
See also: Shared object in Linux without symbol interposition, -fno-semantic-interposition error.
puzzled by mov $0, %edi
You're looking at disassembly output from the .o, where addresses are just placeholder 0s that will be replaced by the linker at link time, based on the relocation information in the ELF object file. That's why #Leandros suggested objdump -r.
Similarly, the relative displacement in the call machine code is all-zeros, because the linker hasn't filled it in yet.
I'm still studying this linking process myself, but wanted to restate something in my own words. The PLT-related user function calls might not all be stuffed with the proper code by the time execution starts. Doing so could take a lot of time at the start of execution; and not all the function calls instrumented by the PLT might even be used. So under a 'lazy binding' method, the very first time a 'user' function is called through the PLT code, it always jumps to the PLT 'binding function' first. The binding function goes out and finds the right address for the 'user' function (I think from the GOT) and then replaces the PLT entry (that points to the binding function) with the code pointing to the 'user' function. So thereafter every time the user function is called, the 'lazy' binding function is not called; the 'user' function is called instead. This might be why the PLT entry looks odd at first blush; it's pointing to the binding function and not to the 'user' function.
I'm currently having a weird issue when trying to run a C program that calls a very simple ARM assembly function. Here's my C code:
#include <stdio.h>
#include <stdlib.h>
extern void getNumber(int* pointer);
int main()
{
int* pointer = malloc(sizeof(int));
getNumber(pointer);
printf("%d\n", *pointer);
return 0;
}
And here's my assembly code:
.section .text
.align 4
.arm
.global getNumber
.type getNumber STT_FUNC
getNumber:
mov r1, #0
str r1, [r0]
bx lr
So far so good. However, if I add a line with mov r7, #0 at the top of getNumber, the program segfaults when trying to access pointer. After inspecting it with gdb I noticed now the pointer itself is stored at a very low address, such as 0xa.
Now, I did a bit of research and apparently r7 is the frame pointer for THUMB code (according to this). However, I'm clearly stating I don't want to use THUMB instructions in the .arm line in my assembly code. Why on earth is it failing?
I'm compiling both the .c and .s files using arm-linux-gnueabihf-gcc, and I'm running the program on a Cortex-A8 based board running Arch Linux.
Edit: The program runs fine if I compile using the -fomit-frame-pointer flag. However, I still want to know why is it using r7 as the frame pointer.
Edit 2: It's still failing even if I use .code 32 instead of .arm.
The ARM Procedure Call Standard specifies the following:
A subroutine must preserve the contents of the registers r4-r8, r10, r11 and SP (and r9 in PCS variants that designate r9 as v6).
So your assembly language subroutine must save & restore r7 if it uses it.
You might be avoiding the problem with your small test program by by not compiling for Thumb mode, but you're just accidentally avoiding the problem. Anything that links to your assembly routine is entitled to expect that r7 will be preserved.
You're crashing the program because your are corrupting the frame pointer, like you mentioned. There is really no rhyme or reason to the convention. Just that ARM reserves certain registers for certain things. Kinda like in x86 esp is the stack pointer.
Here's a pretty good reference for registers to avoid:
http://msdn.microsoft.com/en-us/library/ms253599(v=vs.80).aspx
I finally got it: doing $ arm-linux-gnueabihf-gcc -v showed me the default options my compiler is using. Among those is: --with-mode=thumb.
Compiling with -marm fixed it. Now it's working as intended!
Edit: Upon reading the comments here I realize I was mistaken. I should've saved/restored r7 so it wouldn't screw up the rest of my program. Good thing I learned this now with a toy project and not while working on something real!
I want to compile a program with gcc with link time optimization for an ARM processor. When I compile without LTO, the system gets compiled. When I enable LTO
(with -flto), I get the following assembler-error:
Error: invalid literal constant: pool needs to be closer
Looking around the web I found out that this has something to do with the constants in my system, which are placed in a special section called .rodata, which is called a constant pool and is placed right after the .text section in my system. It seems that when compiling with LTO because of inlining and other optimizations this .rodata section gets too far away from the instructions, so that the addressing of the constants is not possible anymore. Is it possible to place the constants right after the function that uses them? Or is it possible to use another addressing mode so the .rodata section can still be addressed? Thanks.
This is an assembler message, not a linker message, so this happens before sections are generated.
The assembler has a pseudo instruction for loading constants into registers:
ldr r0, =0x12345678
this is expanded into
ldr r0, [constant_12345678, r15]
...
bx lr
constant_12345678:
dw 0x12345678
The constant pool usually follows the return instruction. With function inlining, the function can get long enough that the return instruction is too far away; unfortunately, the compiler has no idea of the distance between memory addresses, and the assembler has no idea of control flow other than "flow does not pass beyond the return instruction, so it is safe to emit the constant pool here".
Unfortunately, there is no good solution at the moment.
You could try an asm block containing
b 1f
.ltorg
1:
This will force-emit the constant pool at this point, at the cost of an extra branch instruction.
It may be possible to instruct the assembler to omit the branch if the constant pool is empty, but I cannot test that at the moment, so this is probably not valid:
.if (2f - 1f)
.b 2f
.endif
1:
.ltorg
2:
"This is an assembler message, not a linker message, so this happens before sections are generated" - I am not sure but I think it is a little bit more complicated with LTO. Compiling (including assembling) of the individual c-files with LTO enabled works fine and does not cause any problems. The problem occurs when I try to link them together with LTO enabled. I don't know how LTO is exactly done, but apparently this also includes calling the assembler again and then I get this error message. When linking without LTO, everything is fine and when I look at the disassemly I can see that my constants are not placed after a function. Instead all constants are placed in the .rodata section. With LTO enabled because of inlining, my functions probably get to large to reach the constant pool...