What would happen if a system executes a part of the file that is zero-padded? - file

I've seen in some posts/videos/files that they are zero-padded to look bigger than they are, or match "same file size" criteria some file system utilities have for moving files, mostly they are either prank programs, or malware.
But I often wondered, what would happen if the file corrupted, and would "load" the next set of "instructions" that are in the big zero-padded space at the end of the file?
Would anything happen? What's the instruction set for 0x0?

The decoding of 0 bytes completely depends on the CPU architecture. On many architectures, instruction are fixed length (for example 32-bit), so the relevant thing would be 00 00 00 00 (using hexdump notation).
On most Linux distros, clang/llvm comes with support for multiple target architectures built-in (clang -target and llvm-objdump), unlike gcc / gas / binutils, so I was able to use that to check for some architectures I didn't have cross-gcc / binutils installed for. Use llvm-objdump --version to see the supported list. (But I didn't figure out how to get it to disassemble a raw binary like binutils objdump -b binary, and my clang won't create SPARC binaries on its own.)
On x86, 00 00 (2 bytes) decodes (http://ref.x86asm.net/coder32.html) as an 8-bit add with a memory destination. The first byte is the opcode, the 2nd byte is the ModR/M that specifies the operands.
This usually segfaults right away (if eax/rax isn't a valid pointer), or segfaults once execution falls off the end of the zero-padded part into an unmapped page. (This happens in real life because of bugs like falling off the end of _start without making an exit system call), although in those cases the following bytes aren't always all zero. e.g. data, or ELF metadata.)
x86 64-bit mode: ndisasm -b64 /dev/zero | head:
address machine code disassembly
00000000 0000 add [rax],al
x86 32-bit mode (-b32):
00000000 0000 add [eax],al
x86 16-bit mode: (-b16):
00000000 0000 add [bx+si],al
AArch32 ARM mode: cd /tmp && dd if=/dev/zero of=zero bs=16 count=1 && arm-none-eabi-objdump -z -D -b binary -marm zero. (Without -z, objdump skips over large blocks of all-zero and shows ...)
addr machine code disassembly
0: 00000000 andeq r0, r0, r0
ARM Thumb/Thumb2: arm-none-eabi-objdump -z -D -b binary -marm --disassembler-options=force-thumb zero
0: 0000 movs r0, r0
2: 0000 movs r0, r0
AArch64: aarch64-linux-gnu-objdump -z -D -b binary -maarch64 zero
0: 00000000 .inst 0x00000000 ; undefined
MIPS32: echo .long 0 > zero.S && clang -c -target mips zero.S && llvm-objdump -d zero.o
zero.o: file format ELF32-mips
Disassembly of section .text:
0: 00 00 00 00 nop
PowerPC 32 and 64-bit: -target powerpc and -target powerpc64. IDK if any extensions to PowerPC use the 00 00 00 00 instruction encoding for anything, or if it's still an illegal instruction on modern IBM POWER chips.
zero.o: file format ELF32-ppc (or ELF64-ppc64)
Disassembly of section .text:
0: 00 00 00 00 <unknown>
IBM S390: clang -c -target systemz zero.S
zero.o: file format ELF64-s390
Disassembly of section .text:
0: 00 00 <unknown>
2: 00 00 <unknown>

Related

Assembly code different from gdb display of code

I'm learning about operating systems from the book Operating Systems from 0 to 1, and I'm trying to display the code in my kernel called main, however the code displayed in GDB is not the same even though I jumped to the address that is the entry point.
bootloader.asm
;*************************************************
; bootloader.asm
; A Simple Bootloader
;*************************************************
bits 16
start: jmp boot
;; constants and variable definitions
msg db "Welcome to My Operating System!", 0ah, 0dh, 0h
boot:
cli ; no interrupts
cld ; all that we need to init
mov ax, 0x0000
;; set buffer
mov es, ax
mov bx, 0x0600
mov al, 1 ; read one sector
mov ch, 0 ; track 0
mov cl, 2 ; sector to read
mov dh, 0 ; head number
mov dl, 0 ; drive number
mov ah, 0x02 ; read sectors from disk
int 0x13 ; call the BIOS routine
jmp 0x0000:0x0600 ; jump and execute the sector!
hlt ; halt the system
; We have to be 512 bytes. Clear the rest of the bytes with 0
times 510 - ($-$$) db 0
dw 0xAA55 ; Boot Signature
readelf -l main
ELF Header:
Magic: 7f 45 4c 46 01 01 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: Intel 80386
Version: 0x1
Entry point address: 0x600
Start of program headers: 52 (bytes into file)
Start of section headers: 12888 (bytes into file)
Flags: 0x0
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 3
Size of section headers: 40 (bytes)
Number of section headers: 12
Section header string table index: 11
readelf -l main
Elf file type is EXEC (Executable file)
Entry point 0x600
There are 3 program headers, starting at offset 52
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
PHDR 0x000000 0x00000000 0x00000000 0x00094 0x00094 R 0x4
LOAD 0x000000 0x00000000 0x00000000 0x00094 0x00094 R 0x4
LOAD 0x000100 0x00000600 0x00000600 0x00006 0x00006 R E 0x100
Section to Segment mapping:
Segment Sections...
00
01
02 .text
main.c
void main(){}
objdump -z -M intel -S -D build/os/main
Disassembly of section .text:
00000600 <main>:
void main(){}
600: 55 push ebp
601: 89 e5 mov ebp,esp
603: 90 nop
604: 5d pop ebp
605: c3 ret
But this is GDB's output by setting a breakpoint at main 0x600
0x600 <main> jg 0x647 │
│ 0x602 <main+2> dec esp │
│ 0x603 <main+3> inc esi │
│ 0x604 <main+4> add DWORD PTR [ecx],eax │
why is this happening? Am I loading at the wrong address? How do I find the correct address to load at?
edit:
here is the code for compiling;
nasm -f elf bootloader.asm -F dwarf -g -o ../build/bootloader/bootloader.o
ld -m elf_i386 -T bootloader.lds ../build/bootloader/bootloader.o -o ../build/bootloader/bootloader.o.elf
objcopy -O binary ../build/bootloader/bootloader.o.elf ../build/bootloader/bootloader.o
gcc -ffreestanding -nostdlib -fno-pic -gdwarf-4 -m16 -ggdb3 -c main.c -o ../build/os/main.o
ld -m elf_i386 -nmagic -T os.lds ../build/os/main.o -o ../build/os/main
dd if=/dev/zero of=disk.img bs=512 count=2880
2880+0 records in
2880+0 records out
1474560 bytes (1.5 MB, 1.4 MiB) copied, 0.0150958 s, 97.7 MB/s
dd conv=notrunc if=build/bootloader/bootloader.o of=disk.img bs=512 count=1 seek=0
1+0 records in
1+0 records out
512 bytes copied, 0.000127745 s, 4.0 MB/s
dd conv=notrunc if=build/os/main.o of=disk.img bs=512 count=$((8504/512))
seek=1
16+0 records in
16+0 records out
8192 bytes (8.2 kB, 8.0 KiB) copied, 0.000184251 s, 44.5 MB/s
qemu-system-i386 -machine q35 -fda disk.img -gdb tcp::26000 -S
and gdb code for displaying main code;
set architecture i8086
target remote localhost:26000
b *0x7c00
set disassembly-flavor intel
layout asm
layout reg
symbol-file build/os/main
b main
jg / dec esp / inc esi is the ELF magic number, not machine code! You'll see the same thing from the start of the output of ndisasm -b32 /bin/ls. (ndisasm always treats its input as a flat binary; it doesn't look for any metadata.)
7F 45 4C 46 is the string "ELF" after a 0x7F byte, the ELF magic number that identifies the file format as ELF. It's followed by more ELF header bytes before the actual machine code for main. objdump -D disassembles all ELF sections, but it still parses the ELF headers, not disassembling them like ndisasm does. So you still just end up seeing the code from the .text section because the others are empty (because you linked this executable without libc or CRT startfiles, and with C main as the ELF entry point?!?)
You're jumping to the start of the ELF file as if it was a flat binary. It's not, writing an ELF program loader is not that simple. The ELF program headers (which readelf can parse) tell you which file offset goes at which address. The start of the .text section will be at some offset into the file, not overlapping the ELF magic number for obvious reasons. (Although it can overlap with the ELF header if you can find a way to make it fit: http://www.muppetlabs.com/~breadbox/software/tiny/teensy.html)
Then once you have the file mapped into memory as specified in the program headers, you jump to the ELF entry point address (0x600 in your case). (Which is normally not a function; under a real OS like Linux, you can't ret from the entry point. Instead you need to make an exit system call.) You can't here, either, because you jmp to it instead of call.
This is why _start is separate from main; building a program with a compiler-generated main as its entry point doesn't work.
Of course most of this effort is doomed because you're jumping to your main with the CPU still in 16-bit real mode. But your main is compiled/assembled for 32-bit mode. You could somewhat work around that with gcc -m16 to assemble gcc output for 16-bit mode, using operand-size + address-size prefixes as necessary.
The machine code for that do-nothing main will actually work the in both 16 and 32-bit mode. If you'd used a return 0 without optimization, that wouldn't be the case: the opcode (without prefixes) for mov eax, imm32 implies a different instruction length depending on what mode the CPU decodes it in, so decoding in 16-bit mode would write AX and leave 2 bytes of zeros.
Most likely the easiest thing to do is turn your "kernel" into a flat binary, instead of writing an ELF program loader in your bootloader. Follow an osdev tutorial because lots can go wrong, and you have to be careful about static data for example.
Or see How to make the kernel for my bootloader? for an example bootloader that calls a C function after switching to 32-bit protected mode.
See more links in https://stackoverflow.com/tags/x86/info.

Generate raw binary from C code in Linux

I have been implementing just for fun a simple operating system for x86 architecture from scratch. I implemented the assembly code for the bootloader that loads the kernel from disk and enters in 32-bit mode. The kernel code that is loaded is written in C, so in order to be executed the idea is to generate the raw binary from the C code.
Firstly, I used these commands:
$gcc -ffreestanding -c kernel.c -o kernel.o -m32
$ld -o kernel.bin -Ttext 0x1000 kernel.o --oformat binary -m elf_i386
However, it didn't generate any binary giving back these errors:
kernel.o: In function 'main':
kernel.c:(.text+0xc): undefined reference to '_GLOBAL_OFFSET_TABLE_'
Just for clarity sake, the kernel.c code is:
/* kernel.c */
void main ()
{
char *video_memory = (char *) 0xb8000 ;
*video_memory = 'X';
}
Then I followed this tutorial: http://wiki.osdev.org/GCC_Cross-Compiler
to implement my own cross-compiler for my own target. It worked for my purpose, however disassembling with the command ndisasm I obtained this code:
00000000 55 push ebp
00000001 89E5 mov ebp,esp
00000003 83EC10 sub esp,byte +0x10
00000006 C745FC00800B00 mov dword [ebp-0x4],0xb8000
0000000D 8B45FC mov eax,[ebp-0x4]
00000010 C60058 mov byte [eax],0x58
00000013 90 nop
00000014 C9 leave
00000015 C3 ret
00000016 0000 add [eax],al
00000018 1400 adc al,0x0
0000001A 0000 add [eax],al
0000001C 0000 add [eax],al
0000001E 0000 add [eax],al
00000020 017A52 add [edx+0x52],edi
00000023 0001 add [ecx],al
00000025 7C08 jl 0x2f
00000027 011B add [ebx],ebx
00000029 0C04 or al,0x4
0000002B 0488 add al,0x88
0000002D 0100 add [eax],eax
0000002F 001C00 add [eax+eax],bl
00000032 0000 add [eax],al
00000034 1C00 sbb al,0x0
00000036 0000 add [eax],al
00000038 C8FFFFFF enter 0xffff,0xff
0000003C 16 push ss
0000003D 0000 add [eax],al
0000003F 0000 add [eax],al
00000041 41 inc ecx
00000042 0E push cs
00000043 088502420D05 or [ebp+0x50d4202],al
00000049 52 push edx
0000004A C50C04 lds ecx,[esp+eax]
0000004D 0400 add al,0x0
0000004F 00 db 0x00
As you can see, the first 9 rows (except for the NOP that I don't know why it is inserted) are the assembly translation of my main function. From 10 row to the end, there's a lot code that I don't know why it is here.
In the end, I have two questions:
1) Why is it produced that code?
2) Is there a way to produce the raw machine code from C without that useless stuff?
A few hints first:
avoid naming your starting routine main. It is confusing (both for the reader and perhaps for the compiler; when you don't pass -ffreestanding to gcc it is handling main very specifically). Use something else like start or begin_of_my_kernel ...
compile with gcc -v to understand what your particular compiler is doing.
you probably should ask your compiler for some optimizations and all warnings, so pass -O -Wall at least to gcc
you may want to look into the produced assembler code, so use gcc -S -O -Wall -fverbose-asm kernel.c to get the kernel.s assembler file and glance into it
as commented by Michael Petch you might want to pass -fno-exceptions
your probably need some linker script and/or some hand-written assembler for crt0
you should read something about linkers & loaders
kernel.c:(.text+0xc): undefined reference to '_GLOBAL_OFFSET_TABLE_'
This smells like something related to position-independent-code. My guess: try compiling with an explicit -fno-pic or -fno-pie
(on some Linux distributions, their gcc might be configured with some -fpic enabled by default)
PS. Don't forget to add -m32 to gcc if you want x86 32 bits binaries.

Creating x86 bootloader

I am writing a bootloader as follows:
bits 16
[org 0x7c00]
KERN_OFFSET equ 0x1000
mov [BOOTDISK], dl
mov dl, 0x0 ;0 is for floppy-disk
mov ah, 0x2 ;Read function for the interrupt
mov al, 0x15 ;Read 15 sectors conating kernel
mov ch, 0x0 ;Use cylinder 0
mov cl, 0x2 ;Start from the second sector which contains kernel
mov dh, 0x0 ;Read head 0
mov bx, KERN_OFFSET
int 0x13
jc disk_error
cmp al, 0x15
jne disk_error
jmp KERN_OFFSET:0x0
jmp $
disk_error:
jmp $
BOOTDISK: db 0
times 510-($-$$) db 0
dw 0xaa55
The kernel is a simple C program which prints "e" on the VGA display (seen on QEmu):
void main()
{
extern void put_in_mem();
char c = 'e';
put_in_mem(c, 0xA0);
}
I am using this code in 16 bit (real mode) in QEmu so I am using the compiler bcc for this code using:
bcc -ansi -c -o kernel.o kernel.c
I have the following questions:
1. When I try to disassemble this code, using
objdump -D -b binary -mi386 kernel.o
I get an output like this (only initial portion of output):
kernel.o: file format binary
Disassembly of section .data:
00000000 <.data>:
0: a3 86 01 00 2a mov %eax,0x2a000186
5: 3e 00 00 add %al,%ds:(%eax)
8: 00 22 add %ah,(%edx)
a: 00 00 add %al,(%eax)
c: 00 19 add %bl,(%ecx)
e: 00 00 add %al,(%eax)
10: 00 55 55 add %dl,0x55(%ebp)
13: 55 push %ebp
14: 55 push %ebp
15: 00 00 add %al,(%eax)
17: 00 02 add %al,(%edx)
19: 22 00 and (%eax),%al
This output does not seem to correspond to the kernel.c file I made. For example I could not see where 'e' is stored as ASCII 0x65 or where is the call to put_in_mem made. Is something wrong with the way I am disassembling the code?
To make the object file of the kernel for QEmu I used the following command:
ld86 -o kernel -d kernel.o put_in_mem.o
Here put_in_mem.o is the object file created after assembling the put_in_mem.asm file which contains the definition of the function put_in_mem() used in kernel.c.
Then floppy image for QEmu is made using:
cat boot.o kernel > floppy_img
But when I try to look at the address 0x10000 (using GDB), where the kernel was supposed to be present after loading (using the boot.asm program), it was not present.
Why is this happening?
Further, in ld command we used -Ttext option to specify the load address of the binary, should we use some similar option here with ld86?
Your kernel.o is in an object file format not understood by objdump so it tries to disassemble everything in it, including headers and whatnot. Try to disassemble the linked output kernel instead. Also objdump might not understand 16 bit code. Better try objdump86 if you have that available.
As to why it's not present: you are looking at the wrong place. You are loading it to offset 0x1000 (3 zeroes) but you are looking at 0x10000 (4 zeroes). Also note that you don't set up ES which is bad practice. Maybe you intended to set ES to 0x1000 and BX to 0x0000 and then you would find your kernel at 0x10000 physical address.
The -Ttext doesn't influence loading, it only specifies where the code expects to find itself.

powerpc disassemble instruction by its raw data

How can I disassemble some intructions from memory dump? I have only raw dump, objdump understands only object format.
My processor is PowerPC 440 (PowerPC Book E Architecture).
In fact, objdump can disassemble raw binary just fine. Try this:
objdump -m ppc -D -b binary -EB dump.bin
$ xxd -p asm
3800000060000000
Using devkitPPC:
$ powerpc-eabi-objdump --disassemble-zeroes -m powerpc -D -b binary -EB asm
Yields:
asm: file format binary
Disassembly of section .data:
00000000 <.data>:
0: 38 00 00 00 li r0,0
4: 60 00 00 00 nop

Why isn't a.out in machine language?

I compile the following program with gcc and receive an output executable file a.out.:
#include <stdio.h>
int main () {
printf("hello, world\n");
}
When I execute cat a.out, why is the file in "gibberish" (what is this called?) and not machine language of 0s and 1s:
??????? H__PAGEZERO(__TEXT__text__TEXT?`??__stubs__TEXT
P__unwind_info__TEXT]P]__eh_frame__TEXT?H??__DATA__program_vars [continued]
The file is in 0 and 1, but when you open it with text editor those bits are grouped in bytes and then treated as text ;) In Linux you could try to disassemble the output file to ensure that it contains machine instructions (x86 architecture):
objdump -D -mi386 a.out
Example output:
1: 83 ec 08 sub $0x8,%esp
4: be 01 00 00 00 mov $0x1,%esi
9: bf 00 00 00 00 mov $0x0,%edi
The second column contains that 0's and 1's in hexadecimal notation and the third column contains mnemonic assembler instructions.
If you want to display those 0's and 1's simply type:
xxd -b a.out
Example output:
0000000: 01111111 01000101 01001100 01000110 00000010 00000001 .ELF..
0000006: 00000001 00000000 00000000 00000000 00000000 00000000 ......
It's in some kind of executable file format. On Linux, it's probably ELF, on Mac OS X it's probably Mach-O, and so on. There's even an a.out format, but it's not that common anymore.
It can't just be bare machine instructions - the operating system needs some information about how to load it, what dynamic libraries to attach to it, etc.
Characters are also made of 0's and 1's, and the computer has no way of knowing the difference. You asked it to show the file and it did.
In addition to the machine instructions, the binary file also contains layout and optional debug information which can be readable strings.
The a.out is in a format the loader of the OS you are using can understand. Those different texts you see are markers for different parts of the 0s and 1s you expect.
The ? and ` show spots where there are binary unprintable data.
The typical format on Linux systems these days is ELF. The ELF file may contain machine code, which you can examine with the objdump utility.
$ gcc main.c
$ objdump -d -j .text a.out
a.out: file format elf64-x86-64
Disassembly of section .text:
(code omitted for brevity)
00000000004005ac :
4005ac: 55 push %rbp
4005ad: 48 89 e5 mov %rsp,%rbp
4005b0: bf 6c 06 40 00 mov $0x40066c,%edi
4005b5: e8 d6 fe ff ff callq 400490
4005ba: 5d pop %rbp
4005bb: c3 retq
4005bc: 0f 1f 40 00 nopl 0x0(%rax)
See? Machine code. The objdump utility helpfully prints it in hexadecimal with the corresponding disassempled code on the right, and the addresses on the left.

Resources