Recently I've become interested in writing my own really really basic OS.
I wrote (well, copied) some basic Assembly that establishes a stack and does some basic things and this seemed to work fine, however attempting to introduce C into the mix has screwed everything up.
I have two main project files: loader.s which is some NASM that creates the stack and calls my C function, and kernel.c which contains the basic C function.
My issue at the moment is essentially that QEMU freezes up when I run my kernel.bin file. I'm guessing there are any number of things wrong with my code -- perhaps this question isn't really appropriate for a StackOverflow format due to its extreme specificity. My project files are as follows:
loader.s:
BITS 16 ; 16 Bits
extern kmain ; Our 'proper' kernel function in C
loader:
mov ax, 07C0h ; Move the starting address [7C00h] into 'ax'
add ax, 32 ; Leave 32 16 byte blocks [200h] for the 512 code segment
mov ss, ax ; Set 'stack segment' to the start of our stack
mov sp, 4096 ; Set the stack pointer to the end of our stack [4096 bytes in size]
mov ax, 07C0h ; Use 'ax' to set 'ds'
mov ds, ax ; Set data segment to where we're loaded
mov es, ax ; Set our extra segment
call kmain ; Call the kernel proper
cli ; Clear ints
jmp $ ; Hang
; Since putting these in and booting the image without '-kernel' can't find
; a bootable device, we'll comment these out for now and run the ROM with
; the '-kernel' flag in QEMU
;times 510-($-$$) db 0 ; Pad remained of our boot sector with 0s
;dw 0xAA55 ; The standard 'magic word' boot sig
kernel.c:
#include <stdint.h>
void kmain(void)
{
unsigned char *vidmem = (char*)0xB8000; //Video memory address
vidmem[0] = 65; //The character 'A'
vidmem[1] = 0x07; //Light grey (7) on black (0)
}
I compile everything like so:
nasm -f elf -o loader.o loader.s
i386-elf-gcc -I/usr/include -o kernel.o -c kernel.c -Wall -nostdlib -fno-builtin -nostartfiles -nodefaultlibs
i386-elf-ld -T linker.ld -o kernel.bin loader.o kernel.o
And then test like so:
qemu-system-x86_64 -kernel kernel.bin
Hopefully someone can have a look over this for me -- the code snippets aren't massively long.
Thanks.
Gosh, where to begin? (rhughes, is that you?)
The code from loader.s goes into the Master Boot Record (MBR). The MBR, however, also holds the partition table of the hard drive. So, once you assembled the loader.s, you have to merge it with the MBR: The code from loader.s, the partition table from the MBR. If you just copy the loader.s code into the MBR, you killed your hard drive's partitioning. To properly do the merge, you have to know where exactly the partition table is located in the MBR...
The output from loader.s, which goes into the MBR, is called a "first stage bootloader". Due to the things described above, you only have 436 bytes in that first stage. One thing you cannot do at this point is slapping some C compiler output on top of that (i.e. making your binary larger than one sector, the MBR) and copying that to the hard drive. While it might work temporarily on an old hard drive, modern ones carry yet more partitioning information in sector 1 onward, which would be destroyed by your copying.
The idea is that you compile kernel.c into a separate binary, the "second stage". The first stage, in the 436 bytes available, then uses the BIOS (or EFI) to load the second stage from a specific point on the hard drive (because you won't be able to add partition table and file system parsing to the first stage), then jump to that just-loaded code. Since the second stage isn't under the same kind of size limitation, it can then go ahead to do the proper thing, i.e. parse the partitioning information, find the "home" partition, parse its file system, then load and parse the actual kernel binary.
I hope you are aware that I am looking at all this from low-earth orbit. Bootloading is one heck of an involved process, and no-one can hope to detail it in one SO posting. Hence, there are websites dedicated to these subjects, like OSDev. But be warned: This kind of development takes experienced programmers, people capable of doing professional-grade research, asking questions the smart way, and carrying their own weight. Since these skills are on a general decline these days, OS development websites have a tendency for grumpy reactions if you approach it wrongly.(*)
(*): Or they toss uncommented source at you, like dwalter did just as I finished this post. ;-)
Edit: Of course, none of this is the actual reason why the emulator freezes. i386-elf-gcc is a compiler generating code for 32-bit protected mode, assuming a "flat" memory model, i.e. code / data segments beginning at zero. Your loader.s is 16-bit real mode code (as stated by the BITS 16 part), which does not activate protected mode, and does not initialize the segment registers to the values expected by GCC, and then proceeds to jump to the code generated by GCC under false assumptions... BAM.
Related
Why can you find jmp esp only in big applications?
In this little program you cant find jmp esp. But why?
This is the source code:
#include <stdio.h>
int main(int argc, char **argv)
{
char buffer[64];
printf("Type in something: ");
gets(buffer);
return 0;
}
AT&T jmp *%esp / Intel jmp esp has machine code ff e4. You should be looking for that byte sequence at any offset.
(I assembled a .s with that instruction and used objdump -d to get the machine code.)
There is a lot of discussion in comments from people who thought you were talking about
jmp *(%esp) as a ret without pop. For future readers, see Why JMP ESP instead of directly jumping into the stack on security.SE for more about this ret2reg technique to defeat stack ASLR when trying to return to your executable payload. (But not defeating non-executable stacks, so this is rarely useful on its own in modern systems.) It's a special case of a ROP gadget.
Compilers are never going to use that instruction intentionally, so you'll only ever find it as part of the bytes for another instruction, or in a non-code section. Or not at all if no data happens to include it.
Also, your search method could miss it if it did occur.
objdump | grep 'jmp.*esp' is not good here. That will miss ff e4 as part of mov eax, 0x1234e4ff for example. And disassembly of data sections similarly will only "check" bytes where objdump decides that an instruction starts. (It doesn't do overlapping disassembly starting from every possible byte address; it gets to the end of one instruction and assumes the next instruction starts there.)
But even so, I compiled your code with gcc8.2 with optimization disabled (gcc -m32 foo.c) and searched for e4 bytes in the output of hexdump -C. None of them were preceded by an ff byte. (I tried again with gcc -m32 -no-pie -fno-pie foo.c, still no ff e4)
There's no reason to expect that to appear in a tiny executable.
You could introduce one with a global const unsigned char jmp_esp[] = { 0xff, 0xe4 };
But note that modern toolchains (like late 2018 / 2019) put even the .rodata section in a non-executable segment. So you'd need to compile with -zexecstack for byte sequences in non-code sections to be useful as gadgets.
But you probably need -z execstack or something else to make the stack itself executable, for your payload itself to be in an executable page, not just a jmp esp in a const array.
If you disabled library ASLR, then you could use an ff e4 at a known address somewhere in libc. But with normal randomization of library mapping addresses, it's probably just as easy to try to guess the stack address of your buffer directly, +- some bytes you fill with a NOP slide. (Unless you can get the program you're attacking to leak a library address, defeating ASLR).
I want to write an x86 program that multiplies corresponding elements of 2 arrays (array1[0]*array2[0] and so on till 5 elements) and stores the results in a third array. I don't even know where to start. Any help is greatly appreciated.
First thing you'll want to get is an assembler, I'm personally a big fan of NASM in my opinion it has a very clean and concise syntax, it's also what I started on so that's what I'll use for this answer.
Other than NASM you have:
GAS
This is the GNU assembler, unlike NASM there are versions for many architectures so the directives and way of working will be about the same other than the instructions if you switch architectures. GAS does however have the unfortunate downside of being somewhat unfriendly for people who want to use the Intel syntax.
FASM
This is the Flat Assembler, it is an assembler written in Assembly. Like NASM it's unfriendly to people who want to use AT&T syntax. It has a few rough edges but some people seem to prefer it for DOS applications (especially because there's a DOS port of it) and bare metal work.
Now you might be reading 'AT&T syntax' and 'Intel syntax' and wondering what's meant by that. These are dialects of x86 assembly, they both assemble to the same machine code but reflect slightly different ways of thinking about each instruction. AT&T syntax tends to be more verbose whereas Intel syntax tends to be more minimal, however certain parts of AT&T syntax have nicer operand orderings tahn Intel syntax, a good demonstration of the difference is the mov instruction:
AT&T syntax:
movl (0x10), %eax
This means get the long value (1 dword, aka 4 bytes) and put it in the register eax. Take note of the fact that:
The mov is suffixed with the operand length.
The memory address is surrounded in parenthesis (you can think of them like a pointer dereference in C)
The register is prefixed with %
The instruction moves the left operand into the right operand
Intel Syntax:
mov eax, [0x10]
Take note of the fact that:
We do not need to suffix the instruction with the operand size, the assembler infers it, there are situations where it can't, in which case we specify the size next to the address.
The register is not prefixed
Square brackets are used to address memory
The second operand is moved into the first operand
I will be using Intel syntax for this answer.
Once you've installed NASM on your machine you'll want a simple build script (when you start writing bigger programs use a Makefile or some other proper build system, but for now this will do):
nasm -f elf arrays.asm
ld -o arrays arrays.o -melf_i386
rm arrays.o
echo
echo " Done building, the file 'arrays' is your executable"
Remember to chmod +x the script or you won't be able to execute it.
Now for the code along with some comments explaining what everything means:
global _start ; The linker will be looking for this entrypoint, so we need to make it public
section .data ; We're going on to describe our data here
array_length equ 5 ; This is effectively a macro and isn't actually being stored in memory
array1 dd 1,4,1,5,9 ; dd means declare dwords
array2 dd 2,6,5,3,5
sys_exit equ 1
section .bss ; Data that isn't initialised with any particular value
array3 resd 5 ; Leave us 5 dword sized spaces
section .text
_start:
xor ecx,ecx ; index = 0 to start
; In a Linux static executable, registers are initialized to 0 so you could leave this out if you're never going to link this as a dynamic executable.
_multiply_loop:
mov eax, [array1+ecx*4] ; move the value at the given memory address into eax
; We calculate the address we need by first taking ecx (which tells us which
; item we want) multiplying it by 4 (i.e: 4 bytes/1 dword) and then adding it
; to our array's start address to determine the address of the given item
imul eax, dword [array2+ecx*4] ; This performs a 32-bit integer multiply
mov dword [array3+ecx*4], eax ; Move our result to array3
inc ecx ; Increment ecx
; While ecx is a general purpose register the convention is to use it for
; counting hence the 'c'
cmp ecx, array_length ; Compare the value in ecx with our array_length
jb _multiply_loop ; Restart the loop unless we've exceeded the array length
; If the loop has concluded the instruction pointer will continue
_exit:
mov eax, sys_exit ; The system call we want
; ebx is already equal to 0, ebx contains the exit status
mov ebp, esp ; Prepare the stack before jumping into the system
sysenter ; Call the Linux kernel and tell it that our program has concluded
If you wanted the full 64-bit result of the 32-bit multiply, use one-operand mul. But normally you only want a result that's the same width as the inputs, in which case imul is most efficient and easiest to use. See links in the x86 tag wiki for docs and tutorials.
You'll notice that this program has no output. I'm not going to cover writing the algorithm to print numbers because we'd be here all day, that's an exercise for the reader (or see this Q&A)
However in the meantime we can run our program in gdbtui and inspect the data, use your build script to build then open your program with the command gdbtui arrays. You'll want to enter these commands:
layout asm
break _exit
run
print (int[5])array3
And GDB will display the results.
I am trying to cross compile a device driver built for x86 architecture to arm platform. It got compiled without any errors, but I dont think whole features are available. So I checked the makefile and found this particular part.
ifeq ($(ARCH),x86_64)
EXTRA_CFLAGS += -mcmodel=kernel -mno-red-zone
This is the only part that depends on architecture it seems. After some time on google, I found that -mcmodel=kernel is for kernel code model and -mno-red-zone is to avoid using red zone in memory and both them were for x86_64. But its not clear to me, what impact does it make setting cmodel to kernel?
(Any insight into the problem with arm is also greatly appreciated.)
The x86 Options section of the GCC manual says:
-mcmodel=kernel
Generate code for the kernel code model. The kernel runs in the negative 2 GB of the address space.
(i.e. the upper 2GiB, addresses like 0xfffffffff0001234)
In the kernel code model, static symbol addresses don't fit in 32-bit zero-extended constants (unlike the default small code model where mov eax, imm32 (5 bytes) is the most efficient way to put a symbol address in a register).
But they do fit in sign-extended 32-bit constants, unlike the large code model for example. So mov rax, sign_extended_imm32 (7 bytes) works, and is the same size but maybe slightly more efficient than lea rax, [rel symbol].
But more importantly mov eax, [table + rdi*4] works, because disp32 displacements are sign-extended to 64 bits. -mcmodel=kernel tells gcc it can do this but not mov eax, table.
RIP-relative addressing can also reach any symbol from any code address (with a rel32 +-2GiB offset), so -fPIC or -fPIE will also make your code work, at the minor expense of not taking advantage of 32-bit absolute addressing in cases where it's useful. (e.g. indexing static arrays).
If you didn't get link errors without -mcmodel=kernel (like these), you probably have a gcc that makes PIE executables by default (common on recent distros), so it avoids absolute addressing.
In x86, GCC generates the following instructions when it wants to call __stack_chk_fail:
; start of the basic block
00000757 call sub_590 ; __stack_chk_fail#plt
0000075c add byte [ds:eax], al
0000075e add byte [ds:eax], al
; start point of another function
Similar behavior happens in ARM:
; start of the basic block
00001000 bl __stack_chk_fail#PLT
00001004 dd 0x0000309c ; data entry, NOT executable indeed!
In static analysis tools, when one wants to build a CFG, the CFG algorithm can't determine last instruction of the basic block which the __stack_chk_fails is called in.
It's reasonable to have some sort of return instruction after calling __stack_chk_fail to prevent CPU to execute instructions (or potentially data entries) which it shouldn't.
In these cases, CFG generator algorithm assumes it's a regular function call and continues traversing to another function's code (in the former example) or to data entries (in the later one) which is totally unwanted.
So, my question is why doesn't GCC insert a return (or branch) instruction at the end point of the basic block?
I am trying to write a kernel, mostly for entertainment purposes, and I am running into a problem were I believe it is triple faulting. Everything worked before I attempted to enable paging. The code that is breaking is this:
void switch_page_directory(page_directory_t *dir){
current_directory = dir;
asm volatile("mov %0, %%cr3":: "r"(&dir->tablesPhysical));
u32int cr0;
asm volatile("mov %%cr0, %0": "=r"(cr0));
cr0 |= 0x80000000;//enable paging
asm volatile("mov %0, %%cr0":: "r"(cr0)); //this line breaks
}//switch page directory
I have been following a variety of tutorials / documents for this but the one I am using to paging is thus http://www.jamesmolloy.co.uk/tutorial_html/6.-Paging.html . I am not sure what other code will be useful in figuring this out but if there is more I should provide I will be more than happy to do so.
Edit=====
I believe the CS,DS and SS are selecting correct entries here's the code used to set them
global gdt_flush
extern gp
gdt_flush:
lgdt [gp] ; Load the GDT with our 'gp' which is a special pointer
mov ax, 0x10 ; 0x10 is the offset in the GDT to our data segment
mov ds, ax
mov es, ax
mov fs, ax
mov gs, ax
mov ss, ax
jmp 0x08:flush2 ; 0x08 is the offset to our code segment: Far jump!
flush2:
ret ; Returns back to the C code!
and here's the gdt struct itself
struct gdt_entry{
unsigned short limit_low;
unsigned short base_low;
unsigned char base_middle;
unsigned char access;
unsigned char granularity;
unsigned char base_high;
} __attribute__((packed));
struct gdt_ptr{
unsigned short limit;
unsigned int base;
} __attribute__((packed));
struct gdt_entry gdt[5];
struct gdt_ptr gp;
The IDT is very similar to this.
GDT: you don't say what the contents of the GDT entries are, but the stuff you've shown looks fairly similar to the earlier part of the tutorial you linked to and if you've set up the entries in the same way, then all should be well (i.e. flat segment mapping with a ring 0 code segment for CS, ring 0 data segment for everything else, both with a base of 0 and a limit of 4GB).
IDT: probably doesn't matter anyway if interrupts are disabled and you're not (yet) expecting to cause page faults.
Page tables: incorrect page tables do seem like the most likely suspect. Make sure that your identity mapping covers all of the code, data and stack memory that you're using (at least).
The source code linked to at the bottom of http://www.jamesmolloy.co.uk/tutorial_html/6.-Paging.html definitely builds something that does work correctly with both QEMU and Bochs, so hopefully you can compare what you're doing with what that's doing, and work out what is wrong.
QEMU is good in general, but I would recommend Bochs for developing really low-level stuff - it includes (or can be configured to include) a very handy internal debugger. e.g. set reset_on_triple_fault=0 on the cpu: line of the config file, set a breakpoint in the switch_page_directory() code, run to the breakpoint, then single step instructions and see what happens...
You can link qemu to a gdb debugger session via the remote debugger tools in gdb. This can be done by issuing the following commands:
qemu -s [optional arguments]
Then inside your gdb session, open your kernel executable, and after setting a break-point in your switch_page_directory() function, type in the following command at the gdb prompt:
target remote localhost:1234
You can then single-step through your kernel at the breakpoint, and see where your triple-fault is taking place.
Another step to consider is to actually install some default exception handlers in your IDT ... the reason you are triple-faulting is because an exception is being thrown by the CPU, but there is no proper exception handler available to handle it. Thus with some default handlers installed, especially the double-fault handler, you can effectively stop the kernel without going into a triple-fault that automatically resets the PC.
Finally, make sure you have re-programmed the PIC before going into protected mode ... otherwise the default hardware interrupts it's programmed to trigger from the BIOS in real-mode will now trigger the exception interrupts in protected mode.
I was also facing the same problem with paging tutorial.But after some searching i found the solution it was happening because as soon as paging is enabled, all address become virtual and to solve it we must map the virtual addresses to the same physical addresses so they refer to the same thing and this is called identity mapping.
you can follow this link for further help in implementing Identity Maping.
and one more thing you have memset the newly allocated space to zero because it may contain garbage values and memset was not done in tutorial it will work on bochs because it set the space to zero for you but other emulator(qemu) and real hardware are so kind.