ARM + gcc: don't use one big .rodata section - c

I want to compile a program with gcc with link time optimization for an ARM processor. When I compile without LTO, the system gets compiled. When I enable LTO
(with -flto), I get the following assembler-error:
Error: invalid literal constant: pool needs to be closer
Looking around the web I found out that this has something to do with the constants in my system, which are placed in a special section called .rodata, which is called a constant pool and is placed right after the .text section in my system. It seems that when compiling with LTO because of inlining and other optimizations this .rodata section gets too far away from the instructions, so that the addressing of the constants is not possible anymore. Is it possible to place the constants right after the function that uses them? Or is it possible to use another addressing mode so the .rodata section can still be addressed? Thanks.

This is an assembler message, not a linker message, so this happens before sections are generated.
The assembler has a pseudo instruction for loading constants into registers:
ldr r0, =0x12345678
this is expanded into
ldr r0, [constant_12345678, r15]
...
bx lr
constant_12345678:
dw 0x12345678
The constant pool usually follows the return instruction. With function inlining, the function can get long enough that the return instruction is too far away; unfortunately, the compiler has no idea of the distance between memory addresses, and the assembler has no idea of control flow other than "flow does not pass beyond the return instruction, so it is safe to emit the constant pool here".
Unfortunately, there is no good solution at the moment.
You could try an asm block containing
b 1f
.ltorg
1:
This will force-emit the constant pool at this point, at the cost of an extra branch instruction.
It may be possible to instruct the assembler to omit the branch if the constant pool is empty, but I cannot test that at the moment, so this is probably not valid:
.if (2f - 1f)
.b 2f
.endif
1:
.ltorg
2:

"This is an assembler message, not a linker message, so this happens before sections are generated" - I am not sure but I think it is a little bit more complicated with LTO. Compiling (including assembling) of the individual c-files with LTO enabled works fine and does not cause any problems. The problem occurs when I try to link them together with LTO enabled. I don't know how LTO is exactly done, but apparently this also includes calling the assembler again and then I get this error message. When linking without LTO, everything is fine and when I look at the disassemly I can see that my constants are not placed after a function. Instead all constants are placed in the .rodata section. With LTO enabled because of inlining, my functions probably get to large to reach the constant pool...

Related

Why we need Clobbered registers list in Inline Assembly?

In my guide book it says:
In inline assembly, Clobbered registers list is used to tell the
compiler which registers we are using (So it can empty them before
that).
Which I totally don't understand, why the compiler needs to know so? what's the problem of leaving those registers as is? did they meant instead to back them up and restore them after the assembly code.
Hope someone can provide an example as I spent hours reading about Clobbered registers list with no clear answers to this problem.
The problems you'd see from failing to tell the compiler about registers you modify would be exactly the same as if you wrote a function in asm that modified some call-preserved registers1. See more explanation and a partial example in Why should certain registers be saved? What could go wrong if not?
In GNU inline-asm, all registers are assumed preserved, except for ones the compiler picks for "=r" / "+r" or other output operands. The compiler might be keeping a loop counter in any register, or anything else that it's going to read later and expect it to still have the value it put there before the instructions from the asm template. (With optimization disabled, the compiler won't keep variables in registers across statements, but it will when you use -O1 or higher.)
Same for all memory except for locations that are part of an "=m" or "+m" memory output operand. (Unless you use a "memory" clobber.) See How can I indicate that the memory *pointed* to by an inline ASM argument may be used? for more details.
Footnote 1:
Unlike for a function, you should not save/restore any registers with your own instructions inside the asm template. Just tell the compiler about it so it can save/restore at the start/end of the whole function after inlining, and avoid having any values it needs in them. In fact, in ABIs with a red-zone (like x86-64 System V) using push/pop inside the asm would be destructive: Using base pointer register in C++ inline asm
The design philosophy of GNU C inline asm is that it uses the same syntax as the compiler internal machine-description files. The standard use-case is for wrapping a single instruction, which is why you need early-clobber declarations if the asm code in the template string doesn't read all its inputs before it writes some registers.
The template is a black box to the compiler; it's up to you to accurately describe it to the optimizing compiler. Any mistake is effectively undefined behaviour, and leaves room for the compiler to mess up other variables in the surrounding code, potentially even in functions that call this one if you modify a call-preserved register that the compiler wasn't otherwise using.
That makes it impossible to verify correctness just by testing. You can't distinguish "correct" from "happens to work with this surrounding code and set of compiler options". This is one reason why you should avoid inline asm unless the benefits outweigh the downsides and risk of bugs. https://gcc.gnu.org/wiki/DontUseInlineAsm
GCC just does a string substitution into the template string, very much like printf, and sends the whole result (including the compiler-generated instructions for the pure C code) to the assembler as a single file. Have a look on https://godbolt.org/ sometime; even if you have invalid instructions in the inline asm, the compiler itself doesn't notice. Only when you actually assemble will there be a problem. ("binary" mode on the compiler-explorer site.)
See also https://stackoverflow.com/tags/inline-assembly/info for more links to guides.

Is it possible in practice to compile millions of small functions into a static binary?

I've created a static library with about 2 million small functions, but I'm having trouble linking it to my main function, using GCC (tested 4.8.5 or 7.3.0) under Linux x86_64.
The linker complains about relocation truncations, very much like those in this question.
I've already tried using -mcmodel=large, but as the answer to that same question says, I would
"need a crt1.o that can handle full 64-bit addresses". I've then tried compiling one, following this answer, but recent glibc won't compile under -mcmodel=large, even if libgcc does, which accomplishes nothing.
I've also tried adding the flags -fPIC and/or -fPIE to no avail. The best I get is this sole error:
ld: failed to convert GOTPCREL relocation; relink with --no-relax
and adding that flag also doesn't help.
I've searched around the Internet for hours, but most posts are very old and I can't find a way to do this.
I'm aware this is not a common thing to try, but I think it should be possible to do this. I'm working in an HPC environment, so memory or time constraints are not the issue here.
Has anyone been successful in accomplishing something similar with a recent compiler and toolchain?
Either don't use the standard library or patch it. As for the 2.34 version, Glibc doesn't support the large code model. (See also Glibc mailing list and Redhat Bugzilla)
Explanation
Let's examine the Glibc source code to understand why recompiling with -mcmodel=large accomplished nothing. It replaced the relocations originating from C files. But Glibc contained hardcoded 32-bit relocations in raw Assembly files, such as in start.S (sysdeps/x86_64/start.S).
call *__libc_start_main#GOTPCREL(%rip)
start.S emitted R_X86_64_GOTPCREL for __libc_start_main, which used relative addressing. x86_64 CALL instruction didn't support relative jumps by more than 32-bit displacement, see AMD64 Manual 3. So, ld couldn't offset the relocation R_X86_64_GOTPCREL because the code size surpassed 2GB.
Adding -fPIC didn't help due to the same ISA constraints. For position-independent code, the compiler still generated relative jumps.
Patching
In short, you have to replace 32-bit relocations in the Assembly code. See System V Application Binary Interface AMD64 Architecture Process Supplement for more info about implementing 64-bit relocations. See also this for a more in-depth explanation of code models.
Why don't 32-bit relocations suffice for the large code model? Because we can't rely on other symbols being in a range of 2GB. All calls must become absolute. Contrast with the small PIC code model, where the compiler generates relative jumps whenever possible.
Let's look closely at the R_X86_64_GOTPCREL relocation. It contains the 32-bit difference between RIP and the symbol's GOT entry address. It has a 64-bit substitute — R_X86_64_GOTPCREL64, but I couldn't find a way to use it in Assembly.
So, to replace the GOTPCREL, we have to compute the symbol entry GOT base offset and the GOT address itself. We can calculate the GOT location once in the function prologue because it doesn't change.
First, let's get the GOT base (code lifted wholesale from the ABI Supplement). The GLOBAL_OFFSET_TABLE relocation specifies the offset relative to the current position:
leaq 1f(%rip), %r11
1: movabs $_GLOBAL_OFFSET_TABLE_, %r15
leaq (%r11, %r15), %r15
With the GOT base residing on the %r15 register, now we have to find the symbol's GOT entry offset. The R_X86_64_GOT64 relocation specifies exactly this. With this, we can rewrite the call to __libc_start_main as:
movabs $__libc_start_main#GOT, %r11
call *(%r11, %r15)
We replaced R_X86_64_GOTPCREL with GLOBAL_OFFSET_TABLE and R_X86_64_GOT64. Replace others in the same vein.
N.B.: Replace R_X86_64_GOT64 with R_X86_64_PLTOFF64 for functions from dynamically linked executables.
Testing
Verify the patch correctness using the following test that requires the large code model. It doesn't contain a million small functions, having one huge function and one small function instead.
Your compiler must support the large code model. If you use GCC, you'll need to build it from the source with the flag -mcmodel=large. Startup files shouldn't contain 32-bit relocations.
The foo function takes more than 2GB, rendering 32-bit relocations unusable. Thus, the test will fail with the overflow error if compiled without -mcmodel=large. Also, add flags -O0 -fPIC -static, link with gold.
extern int foo();
extern int bar();
int foo(){
bar();
// Call sys_exit
asm( "mov $0x3c, %%rax \n"
"xor %%rdi, %%rdi \n"
"syscall \n"
".zero 1 << 32 \n"
: : : "rax", "rdx");
return 0;
}
int bar(){
return 0;
}
int __libc_start_main(){
foo();
return 0;
}
int main(){
return 0;
}
N.B. I used patched Glibc startup files without the standard library itself, so I had to define both _libc_start_main and main.

Compiling PowerPC binary with gcc and restrict useable registers

I have a PowerPC device running a software and I'd like to modify this software by inserting some own code parts.
I can easily write my own assembler code, put it somewhere in an unused region in RAM, replace any instruction in the "official" code by b 0x80001234 where 0x80001234 is the RAM address where my own code extension is loaded.
However, when I compile a C code with powerpc-eabi-gcc, gcc assumes it compiles a complete program and not only "code parts" to be inserted into a running program.
This leads to a problem: The main program uses some of the CPUs registers to store data, and when I just copy my extension into it, it will mess with the previous contents.
For example, if the main program I want to insert code into uses register 5 and register 8 in that code block, the program will crash if my own code writes to r5 or r8. Then I need to convert the compiled binary back to assembler code, edit the appropriate registers to use registers other than r5 and r8 and then compile that ASM source again.
Waht I'm now searching for is an option to the ppc-gcc which tells it "never ever use the PPC registers r5 and r8 while creating the bytecode".
Is this possible or do I need to continue crawling through the ASM code on my own replacing all the "used" registers with other registers?
You should think of another approach to solve this problem.
There is a gcc extension to reserve a register as a global variable:
register int *foo asm ("r12");
Please note that if you use this extension, your program does no longer confirm to the ABI of the operating system you are working on. This means that you cannot call any library functions without risking program crashes, overwritten variables, or crashes.

Embedded: memcpy/memset not used by most CRT startup code ― why?

Context:
I'm working on an ARM target, more specifically a Cortex-M4F microcontroller from ST. When working on such platforms (microcontrollers in general), there's obviously no OS; in order to get a working C/C++ "environment" (moreover, to be standard compliant in regard to initialization of variables) there must be some kind of startup code run at reset that does the minimum setup required before explicitly calling main. Such startup code, as I hinted, must initialize initialized global and static variables (such as int foo = 42;at global scope) and zero-out the other globals (such as int bar; at global scope). Then, if necessary, global "ctors" are called.
On a microcontroller, that simply means that the startup code has to copy data from flash to ram for every initialized global (all in section '.data') and clear the others (all in '.bss'). Because I use GCC, I must supply such a startup code and I happily analyzed several startup codes (and its associated linker script!) bundled with numerous examples I've found on the Internet, all using the same demo board I'm developing on.
Question:
As stated, I've seen numerous startup codes, and they initialize globals in different ways, some more efficient in term of space and time than others. But they all have something odd in common: they didn't use memset nor memcpy, resorting instead to hand-written loops to do the job. As it appears natural to me to use standard functions when possible (simple "DRY principle"), I tried the following in lieu of the initial hand-written loops:
/* Initialize .data section */
ldr r0, DATA_LOAD
ldr r1, DATA_START
ldr r2, DATA_SIZE
bl memcpy /* memcpy(DATA_LOAD, DATA_START, DATA_SIZE); */
/* Initialize .bss section */
ldr r0, BSS_START
mov r1, #0
ldr r2, BSS_SIZE
bl memset /* memset(BSS_START, 0, BSS_SIZE); */
... and it worked perfectly. The space saving are negligible, but it is clearly dead simple now.
So, I thought about it, and I see no reason to do hand-written loops in this case:
memcpy and memset are very likely to be linked in the executable anyway, because the programmer would use it directly, or indirectly through another library;
It is smaller;
Speed is not a very important factor for startup code, but nevertheless it is likely faster;
It's nearly impossible to get it wrong.
Any idea why one wouldn't rely on memcpy and memset for startup code?
I suspect the startup code does not want to make assumptions about the implementation of memcpy and such in libc. For example, the implementation of memcpy might use a global variable set by libc initialization code to report which cpu extensions are available, in order to provide optimized SIMD copying on machines that support such operations. At the point where the early "crt" startup code is running, the storage for such a global might be completely uninitialized (containing random junk), in which case it would be dangerous to call memcpy. Even if making the call works for you, it's a consequence of the implementation (or maybe even the unpredictable results of UB...) making it work; this is probably not something the crt code wants to depend on.
Whether the standard library is linked at all is decision for the application developer (--nostdlib may be used for example), but the start-up code is required, so it cannot make any assumptions.
Further, the purpose of the start-up code is to establish an environment in which C code can run; before that is complete, it is by no means a given that any library code that might reasonably assume a complete run-time environment will run correctly. For the functions in question this is perhaps not an issue in many cases, but you cannot know that.
The start-up code has to at least establish a stack and initialise static data, in C++ it additionally calls the constructors of global static objects. The standard library might reasonably assume those are established, so using the standard library before then may conceivably result in erroneous behaviour.
Finally you should be clear that the C language and the C standard library are distinct entities. The language must necessarily be capable of standing alone.
I don't think this is likely to have anything to do with "assumptions about the internal state of memcy/memset", they are unlikely to use any global resources (though I suppose some odd cases exist where they do).
All start up code on microcontrollers is usually written "inline assembler" in this manner, simply because it runs at an early stage in the code, where a stack might not yet be present and the MMU setup may not yet have been executed. Init code therefore doesn't want to risk putting anything on the stack, simple as that. Function calls put things on the stack.
So while this happened to be the initialization code of the static storage copy-down, you are likely to find the same inline assembler in other such init code as well. For example you will likely find some fundamental register setup code written in assembler somewhere before the copy-down, and you will also find the MMU setup in assembler somewhere around there too.

Where is declaration for get_pc() in GNU ARM?

I'm building legacy code using the GNUARM C compiler and trying to resolve all the implicit declarations of functions.
I've come across some ARM specific functions and can't find the header file containing the declarations for these functions:
get_pc
get_cpsr
get_sp
I have searched the web and only came up with source code containing these functions without any non-standard include files.
I'll also settle for the function declarations.
Since I will also be porting the code to the Cygwin / Windows platform, what are the equivalent declarations for Cygwin GNU GCC?
Thanks.
Just write your own if you really need those functions, asm is easier than inline asm:
.globl get_pc
get_pc:
mov r0,pc
bx lr
.globl get_sp
get_sp:
mov r0,sp
bx lr
.globl get_cpsr
get_cpsr:
mrs r0,cpsr
bx lr
At least for arm. if you are porting to x86 and need the equivalents, I have to wonder what the code needs with those things anyway. the cpsr in particular you would likely have to change any code that uses the result as the status registers across processor vendors/families pretty much never match. The x86 equivalents should still be about the same level of effort, takes longer to do a google search and read the results than it is to just write the code (if you know the processor).
Depending on what your application is doing it is probably better to just comment out any code that calls those functions and/or uses the return value. I can imagine a few reasons why those items would be used, but it could get into architecture specific stuff and that is more involved than just porting a few register read functions. So what user786653 asked is the key question. How are these functions used? Not where can I find them but how are they used and why do you think you need them.
Are you sure those are functions? I'm not very familiar with ARM, but those sound like compiler intrinsics to me. If you're moving to GCC, you might be better off replacing those with inline assembly.

Resources