use of -mcmodel=kernel flag in x86 platform

use of -mcmodel=kernel flag in x86 platform - c

I am trying to cross compile a device driver built for x86 architecture to arm platform. It got compiled without any errors, but I dont think whole features are available. So I checked the makefile and found this particular part.
ifeq ($(ARCH),x86_64)
EXTRA_CFLAGS += -mcmodel=kernel -mno-red-zone
This is the only part that depends on architecture it seems. After some time on google, I found that -mcmodel=kernel is for kernel code model and -mno-red-zone is to avoid using red zone in memory and both them were for x86_64. But its not clear to me, what impact does it make setting cmodel to kernel?
(Any insight into the problem with arm is also greatly appreciated.)

The x86 Options section of the GCC manual says:
-mcmodel=kernel
Generate code for the kernel code model. The kernel runs in the negative 2 GB of the address space.
(i.e. the upper 2GiB, addresses like 0xfffffffff0001234)
In the kernel code model, static symbol addresses don't fit in 32-bit zero-extended constants (unlike the default small code model where mov eax, imm32 (5 bytes) is the most efficient way to put a symbol address in a register).
But they do fit in sign-extended 32-bit constants, unlike the large code model for example. So mov rax, sign_extended_imm32 (7 bytes) works, and is the same size but maybe slightly more efficient than lea rax, [rel symbol].
But more importantly mov eax, [table + rdi*4] works, because disp32 displacements are sign-extended to 64 bits. -mcmodel=kernel tells gcc it can do this but not mov eax, table.
RIP-relative addressing can also reach any symbol from any code address (with a rel32 +-2GiB offset), so -fPIC or -fPIE will also make your code work, at the minor expense of not taking advantage of 32-bit absolute addressing in cases where it's useful. (e.g. indexing static arrays).
If you didn't get link errors without -mcmodel=kernel (like these), you probably have a gcc that makes PIE executables by default (common on recent distros), so it avoids absolute addressing.

Related

Why do compilers insist on using a callee-saved register here?

Consider this C code:
void foo(void);
long bar(long x) {
foo();
return x;
}
When I compile it on GCC 9.3 with either -O3 or -Os, I get this:
bar:
push r12
mov r12, rdi
call foo
mov rax, r12
pop r12
ret
The output from clang is identical except for choosing rbx instead of r12 as the callee-saved register.
However, I want/expect to see assembly that looks more like this:
bar:
push rdi
call foo
pop rax
ret
Since you have to push something to the stack anyway, it seems shorter, simpler, and probably faster to just push your value there, instead of pushing some arbitrary callee-saved register's value there and then storing your value in that register. Ditto for the inverse after call foo when you're putting things back.
Is my assembly wrong? Is it somehow less efficient than messing with an extra register? If the answer to both of those are "no", then why don't either GCC or clang do it this way?
Godbolt link.
Edit: Here's a less trivial example, to show it happens even if the variable is meaningfully used:
long foo(long);
long bar(long x) {
return foo(x * x) - x;
}
I get this:
bar:
push rbx
mov rbx, rdi
imul rdi, rdi
call foo
sub rax, rbx
pop rbx
ret
I'd rather have this:
bar:
push rdi
imul rdi, rdi
call foo
pop rdi
sub rax, rdi
ret
This time, it's only one instruction off vs. two, but the core concept is the same.
Godbolt link.

TL:DR:
Compiler internals are probably not set up to look for this optimization easily, and it's probably only useful around small functions, not inside large functions between calls.
Inlining to create large functions is a better solution most of the time
There can be a latency vs. throughput tradeoff if foo happens not to save/restore RBX.
Compilers are complex pieces of machinery. They're not "smart" like a human, and expensive algorithms to find every possible optimization are often not worth the cost in extra compile time.
I reported this as GCC bug 69986 - smaller code possible with -Os by using push/pop to spill/reload back in 2016; there's been no activity or replies from GCC devs. :/
Slightly related: GCC bug 70408 - reusing the same call-preserved register would give smaller code in some cases - compiler devs told me it would take a huge amount of work for GCC to be able to do that optimization because it requires picking order of evaluation of two foo(int) calls based on what would make the target asm simpler.
If foo doesn't save/restore rbx itself, there's a tradeoff between throughput (instruction count) vs. an extra store/reload latency on the x -> retval dependency chain.
Compilers usually favour latency over throughput, e.g. using 2x LEA instead of imul reg, reg, 10 (3-cycle latency, 1/clock throughput), because most code averages significantly less than 4 uops / clock on typical 4-wide pipelines like Skylake. (More instructions/uops do take more space in the ROB, reducing how far ahead the same out-of-order window can see, though, and execution is actually bursty with stalls probably accounting for some of the less-than-4 uops/clock average.)
If foo does push/pop RBX, then there's not much to gain for latency. Having the restore happen just before the ret instead of just after is probably not relevant, unless there a ret mispredict or I-cache miss that delays fetching code at the return address.
Most non-trivial functions will save/restore RBX, so it's often not a good assumption that leaving a variable in RBX will actually mean it truly stayed in a register across the call. (Although randomizing which call-preserved registers functions choose might be a good idea to mitigate this sometimes.)
So yes push rdi / pop rax would be more efficient in this case, and this is probably a missed optimization for tiny non-leaf functions, depending on what foo does and the balance between extra store/reload latency for x vs. more instructions to save/restore the caller's rbx.
It is possible for stack-unwind metadata to represent the changes to RSP here, just like if it had used sub rsp, 8 to spill/reload x into a stack slot. (But compilers don't know this optimization either, of using push to reserve space and initialize a variable. What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once?. And doing that for more than one local var would lead to larger .eh_frame stack unwind metadata because you're moving the stack pointer separately with each push. That doesn't stop compilers from using push/pop to save/restore call-preserved regs, though.)
IDK if it would be worth teaching compilers to look for this optimization
It's maybe a good idea around a whole function, not across one call inside a function. And as I said, it's based on the pessimistic assumption that foo will save/restore RBX anyway. (Or optimizing for throughput if you know that latency from x to return value isn't important. But compilers don't know that and usually optimize for latency).
If you start making that pessimistic assumption in lots of code (like around single function calls inside functions), you'll start getting more cases where RBX isn't saved/restored and you could have taken advantage.
You also don't want this extra save/restore push/pop in a loop, just save/restore RBX outside the loop and use call-preserved registers in loops that make function calls. Even without loops, in the general case most functions make multiple function calls. This optimization idea could apply if you really don't use x between any of the calls, just before the first and after the last, otherwise you have a problem of maintaining 16-byte stack alignment for each call if you're doing one pop after a call, before another call.
Compilers are not great at tiny functions in general. But it's not great for CPUs either. Non-inline function calls have an impact on optimization at the best of times, unless compilers can see the internals of the callee and make more assumptions than usual. A non-inline function call is an implicit memory barrier: a caller has to assume that a function might read or write any globally-accessible data, so all such vars have to be in sync with the C abstract machine. (Escape analysis allows keeping locals in registers across calls if their address hasn't escaped the function.) Also, the compiler has to assume that the call-clobbered registers are all clobbered. This sucks for floating point in x86-64 System V, which has no call-preserved XMM registers.
Tiny functions like bar() are better off inlining into their callers. Compile with -flto so this can happen even across file boundaries in most cases. (Function pointers and shared-library boundaries can defeat this.)
I think one reason compilers haven't bothered to try to do these optimizations is that it would require a whole bunch of different code in the compiler internals, different from the normal stack vs. register-allocation code that knows how to save call-preserved registers and use them.
i.e. it would be a lot of work to implement, and a lot of code to maintain, and if it gets over-enthusiastic about doing this it could make worse code.
And also that it's (hopefully) not significant; if it matters, you should be inlining bar into its caller, or inlining foo into bar. This is fine unless there are a lot of different bar-like functions and foo is large, and for some reason they can't inline into their callers.

Why do compilers insist on using a callee-saved register here?
Because most compilers would generate nearly the same code for a given function, and are following global calling conventions defined by the ABI targeted by your compiler.
You could define your own different calling conventions (e.g. passing even more function arguments in processor registers, or on the contrary "packing" by bitwise operations two short arguments in a single processor register, etc...), and implement your compiler following them. You probably would need to recode some of the C standard library (e.g. patch lower parts of GNU libc then recompile it, if on Linux).
IIRC, some calling conventions are different on Windows and on FreeBSD and on Linux for the same CPU.
Notice that with a recent GCC (e.g. GCC 10 in start of 2021) you could compile and link with gcc -O3 -flto -fwhole-program and in some cases get some inline expansion. You can also build GCC from its source code as a cross-compiler, and since GCC is free software, you can improve it to follow your private new calling conventions. Be sure to document your calling conventions first.
If performance matters to you a lot, you can consider writing your own GCC plugin doing even more optimizations. Your compiler plugin could even implement other calling conventions (e.g. using asmjit).
Consider also improving TinyCC or Clang or NWCC to fit your needs.
My opinion is that in many cases it is not worth spending months of your efforts to improve performance by just a few nanoseconds. But your employer/manager/client could disagree. Consider also compiling (or refactoring) significant parts of your software to silicon, e.g. thru VHDL, or using specialized hardware e.g. GPGPU with OpenCL or CUDA.

Adding the -O2 option when cross-compiling causes the unwind backtrace to fail

Add -funwind-tables when cross-compiling, you can successfully unwind backtrace through the interface(_Unwind_Backtrace and _Unwind_VRS_Get) in the libgcc library.
But when I added the -O2 option at cross-compiling time, unwind backtrace would fail. I pass -Q -O2 --help=optimizers print out the optimization and testing, but the results and -O2 is different, very strange,

You haven't told us which ARM architecture you are building for - but assuming it's a 32-bit architecture, enabling -O2 has also enabled -fomit-frame-pointer (§ -fomit-frame-pointer)
The frame-pointer
The frame pointer contains the base of the current function's stack frame (and with the knowledge that the caller's frame pointer is stored on the stack, a linked list of all stack frames in the call-tree). It's usually described as fp in documentation - a synonym for r11.
Omitting the frame-pointer
The ARM register file is small at 16 registers - one of which is the program counter.
The frame pointer is one of the 15 that remains and is used only for debugging and diagnostics - specifically to provide stack walk-backs and symbolic debugging.
-fomit-frame-pointer tells the compiler to not maintain a frame pointer, thus liberating the r11 for other uses, and potentially avoiding spill of variables to the stack from registers. It also saves 4 bytes per stack-frame of stack storage and a store and load to the stack.
Naturally, if fp is used as general purpose register, its contents are undefined and walk-backs won't work.
You probably want to reenable the frame pointer with -fno-omit-frame-pointer for your own sanity.

ARM Cortex A7: avoid memory veneers?

On ARMv7, which is Thumb capable, is it right that we can avoid all the veneers by using the BX instruction?
Since this instruction takes a 32 bit register, are we good?
If yes, when I see veneers in the generated code, I should specialize the output for my machine, right?
Thanks

Yes, since BX takes a 32-bit register, there's no need for veeners because you can cover the whole addressing space.
Of course you'd need to load a 32-bit value into the register, which usually means constant pooling, so if you are looking to squeeze every cycle out of it and your program is not too large you're better off with relative branches. As #Notlikethat notes, if you don't already have the address in a register there's no point in using BX when you can just LDR PC, ... (unless you need to support ARMv4T interworking).
Relative, non-conditional, 32-bit Thumb branches have a 24-bit addressing space, so you can reach +/- 16MB (for others see here). If you're doing ELF, be really careful with 16-bit relative Thumb branches. A 32-bit branch will generate a 24-bit relocation and the linker will insert a veener if the target can't be addressed with 24 bits. A 16-bit branch generates a 11-bit relocation and ELF for ARM specifies that the linker is not required to generate veeners for those, so you'd risk a link-time out-of-range branch.

Memory alignment today and 20 years ago

In the famous paper "Smashing the Stack for Fun and Profit", its author takes a C function
void function(int a, int b, int c) {
char buffer1[5];
char buffer2[10];
}
and generates the corresponding assembly code output
pushl %ebp
movl %esp,%ebp
subl $20,%esp
The author explains that since computers address memory in multiples of word size, the compiler reserved 20 bytes on the stack (8 bytes for buffer1, 12 bytes for buffer2).
I tried to recreate this example and got the following
pushl %ebp
movl %esp, %ebp
subl $16, %esp
A different result! I tried various combinations of sizes for buffer1 and buffer2, and it seems that modern gcc does not pad buffer sizes to multiples of word size anymore. Instead it abides the -mpreferred-stack-boundary option.
As an illustration -- using the paper's arithmetic rules, for buffer1[5] and buffer2[13] I'd get 8+16 = 24 bytes reserved on the stack. But in reality I got 32 bytes.
The paper is quite old and a lot of stuff happened since. I'd like to know, what exactly motivated this change of behavior? Is it the move towards 64bit machines? Or something else?
Edit
The code is compiled on a x86_64 machine using gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) like that:
$ gcc -S -o example1.s example1.c -fno-stack-protector -m32

What has changed is SSE, which requires 16 byte alignment, this is covered in this older gcc document for -mpreferred-stack-boundary=num which says (emphasis mine):
On Pentium and PentiumPro, double and long double values should be aligned to an 8 byte boundary (see -malign-double) or suffer significant run time performance penalties. On Pentium III, the Streaming SIMD Extension (SSE) data type __m128 suffers similar penalties if it is not 16 byte aligned.
This is also backed up by the paper Smashing The Modern Stack For Fun And Profit which covers this an other modern changes that break Smashing the Stack for Fun and Profit.

Memory alignment of which stack alignment is just one aspect depends on the architecture. It is partly defined in the Applicaion Binary Interface of the language and a Procedure Call Standard (sometimes it is both in a single spec) for the architecture (CPU, it might even vary depending on platform) and also depends on the compiler/toolchain where the former documents leave room for variations.
The former two documents (names may vary) are mostly for the external interface between functions; they might leave internal structure to the toolchain. Howwever, that has to match the architecture. Normally the hardware requires a minimal alignment, but allows for a larger alignment for performance reasons (e.g.: byte-alignment minimum, but this would require multiple bus-cycles to read a 32 bit word, so the compiler uses a 32 bit alignment).
Normally, the compiler (following the PCS) uses an alignment optimal for the architecture and under control of optimization settings (optimize for speed or size). It takes into account not only the size of the object (aligned to its natural boundary), but also sizes of internal busses (e.g. a 32 bit x86 has internal 64 or 128 bit busses, ARM CPUs have internal 32 to 128 (possibly even wider) bit busses), caches, etc. For local variables, it may also take into account access-patterns, so two adjacent variables may be loaded in parallel into a register pair instead of two separate loads or even reorder such variables.
The stackpointer might require a higher alignment for instance, so the CPU can push in an interrupt frame two registers at once, push vector registers which require higher alignment, etc. You can write quite a thick book about this subject (and I bet, someone already has).
So, in general, there is no single one-alignment-fits all rule. However, for struct and array packing, the C standard does define some rules for packing/alignment, mostly to guarantee consistence of e.g. sizeof(type) and the address in an array (required for correct malloc()).
Even char arrays might be aligned for optimal cache layout. Note it is not only the CPU which might have caches, but also PCIe bridges, not to mention PCIe transfers themselves down to DRAM pages.

I have not tried that specific version of compiler or the distribution version you report. My guess would be the 16 is from byte alignment requirements on stack (i.e. all stack adjustments would be x byte aligned and x may be 16 for your invocation).
Note that variable alignment you seem to have started with, is slightly different from the above and is controlled by align markings on the variable in gcc. Try using those and you should see a difference.

Problems with simple C bootstrap/kernel

Recently I've become interested in writing my own really really basic OS.
I wrote (well, copied) some basic Assembly that establishes a stack and does some basic things and this seemed to work fine, however attempting to introduce C into the mix has screwed everything up.
I have two main project files: loader.s which is some NASM that creates the stack and calls my C function, and kernel.c which contains the basic C function.
My issue at the moment is essentially that QEMU freezes up when I run my kernel.bin file. I'm guessing there are any number of things wrong with my code -- perhaps this question isn't really appropriate for a StackOverflow format due to its extreme specificity. My project files are as follows:
loader.s:
BITS 16 ; 16 Bits
extern kmain ; Our 'proper' kernel function in C
loader:
mov ax, 07C0h ; Move the starting address [7C00h] into 'ax'
add ax, 32 ; Leave 32 16 byte blocks [200h] for the 512 code segment
mov ss, ax ; Set 'stack segment' to the start of our stack
mov sp, 4096 ; Set the stack pointer to the end of our stack [4096 bytes in size]
mov ax, 07C0h ; Use 'ax' to set 'ds'
mov ds, ax ; Set data segment to where we're loaded
mov es, ax ; Set our extra segment
call kmain ; Call the kernel proper
cli ; Clear ints
jmp $ ; Hang
; Since putting these in and booting the image without '-kernel' can't find
; a bootable device, we'll comment these out for now and run the ROM with
; the '-kernel' flag in QEMU
;times 510-($-$$) db 0 ; Pad remained of our boot sector with 0s
;dw 0xAA55 ; The standard 'magic word' boot sig
kernel.c:
#include <stdint.h>
void kmain(void)
{
unsigned char *vidmem = (char*)0xB8000; //Video memory address
vidmem[0] = 65; //The character 'A'
vidmem[1] = 0x07; //Light grey (7) on black (0)
}
I compile everything like so:
nasm -f elf -o loader.o loader.s
i386-elf-gcc -I/usr/include -o kernel.o -c kernel.c -Wall -nostdlib -fno-builtin -nostartfiles -nodefaultlibs
i386-elf-ld -T linker.ld -o kernel.bin loader.o kernel.o
And then test like so:
qemu-system-x86_64 -kernel kernel.bin
Hopefully someone can have a look over this for me -- the code snippets aren't massively long.
Thanks.

Gosh, where to begin? (rhughes, is that you?)
The code from loader.s goes into the Master Boot Record (MBR). The MBR, however, also holds the partition table of the hard drive. So, once you assembled the loader.s, you have to merge it with the MBR: The code from loader.s, the partition table from the MBR. If you just copy the loader.s code into the MBR, you killed your hard drive's partitioning. To properly do the merge, you have to know where exactly the partition table is located in the MBR...
The output from loader.s, which goes into the MBR, is called a "first stage bootloader". Due to the things described above, you only have 436 bytes in that first stage. One thing you cannot do at this point is slapping some C compiler output on top of that (i.e. making your binary larger than one sector, the MBR) and copying that to the hard drive. While it might work temporarily on an old hard drive, modern ones carry yet more partitioning information in sector 1 onward, which would be destroyed by your copying.
The idea is that you compile kernel.c into a separate binary, the "second stage". The first stage, in the 436 bytes available, then uses the BIOS (or EFI) to load the second stage from a specific point on the hard drive (because you won't be able to add partition table and file system parsing to the first stage), then jump to that just-loaded code. Since the second stage isn't under the same kind of size limitation, it can then go ahead to do the proper thing, i.e. parse the partitioning information, find the "home" partition, parse its file system, then load and parse the actual kernel binary.
I hope you are aware that I am looking at all this from low-earth orbit. Bootloading is one heck of an involved process, and no-one can hope to detail it in one SO posting. Hence, there are websites dedicated to these subjects, like OSDev. But be warned: This kind of development takes experienced programmers, people capable of doing professional-grade research, asking questions the smart way, and carrying their own weight. Since these skills are on a general decline these days, OS development websites have a tendency for grumpy reactions if you approach it wrongly.(*)
(*): Or they toss uncommented source at you, like dwalter did just as I finished this post. ;-)
Edit: Of course, none of this is the actual reason why the emulator freezes. i386-elf-gcc is a compiler generating code for 32-bit protected mode, assuming a "flat" memory model, i.e. code / data segments beginning at zero. Your loader.s is 16-bit real mode code (as stated by the BITS 16 part), which does not activate protected mode, and does not initialize the segment registers to the values expected by GCC, and then proceeds to jump to the code generated by GCC under false assumptions... BAM.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight