Why isn't a.out in machine language? - c

I compile the following program with gcc and receive an output executable file a.out.:
#include <stdio.h>
int main () {
printf("hello, world\n");
}
When I execute cat a.out, why is the file in "gibberish" (what is this called?) and not machine language of 0s and 1s:
??????? H__PAGEZERO(__TEXT__text__TEXT?`??__stubs__TEXT
P__unwind_info__TEXT]P]__eh_frame__TEXT?H??__DATA__program_vars [continued]

The file is in 0 and 1, but when you open it with text editor those bits are grouped in bytes and then treated as text ;) In Linux you could try to disassemble the output file to ensure that it contains machine instructions (x86 architecture):
objdump -D -mi386 a.out
Example output:
1: 83 ec 08 sub $0x8,%esp
4: be 01 00 00 00 mov $0x1,%esi
9: bf 00 00 00 00 mov $0x0,%edi
The second column contains that 0's and 1's in hexadecimal notation and the third column contains mnemonic assembler instructions.
If you want to display those 0's and 1's simply type:
xxd -b a.out
Example output:
0000000: 01111111 01000101 01001100 01000110 00000010 00000001 .ELF..
0000006: 00000001 00000000 00000000 00000000 00000000 00000000 ......

It's in some kind of executable file format. On Linux, it's probably ELF, on Mac OS X it's probably Mach-O, and so on. There's even an a.out format, but it's not that common anymore.
It can't just be bare machine instructions - the operating system needs some information about how to load it, what dynamic libraries to attach to it, etc.

Characters are also made of 0's and 1's, and the computer has no way of knowing the difference. You asked it to show the file and it did.
In addition to the machine instructions, the binary file also contains layout and optional debug information which can be readable strings.

The a.out is in a format the loader of the OS you are using can understand. Those different texts you see are markers for different parts of the 0s and 1s you expect.
The ? and ` show spots where there are binary unprintable data.

The typical format on Linux systems these days is ELF. The ELF file may contain machine code, which you can examine with the objdump utility.
$ gcc main.c
$ objdump -d -j .text a.out
a.out: file format elf64-x86-64
Disassembly of section .text:
(code omitted for brevity)
00000000004005ac :
4005ac: 55 push %rbp
4005ad: 48 89 e5 mov %rsp,%rbp
4005b0: bf 6c 06 40 00 mov $0x40066c,%edi
4005b5: e8 d6 fe ff ff callq 400490
4005ba: 5d pop %rbp
4005bb: c3 retq
4005bc: 0f 1f 40 00 nopl 0x0(%rax)
See? Machine code. The objdump utility helpfully prints it in hexadecimal with the corresponding disassempled code on the right, and the addresses on the left.

Related

Which Clang/GCC linker flag should be used to produce offsets in code that stay within the binary range?

I'm trying to link my code with an external static library, that has this piece of code in the binary:
0000000000000000 <some_method>:
0: 48 8d 05 00 00 00 00 lea 0x0(%rip),%rax # 7 <some_method+0x7>
7: c3 retq
After linking with my code, the linker writes an actual offset instead of the zeroes:
00000000000175c0 <some_method>:
175c0: 48 8d 05 39 aa 20 00 lea 0x20aa39(%rip),%rax # 222000 <some_method.method>
175c7: c3 retq
Offset 222000 is supposed to be in the .data section according to the readelf output, which is supposed to be OK, but the problem is that I need to copy my binary code "as is" into some memory space and make it run from there, without using any OS loaders that know how to relocate different sections of the binary in the process address space. The memory address to which I load my binary can change too, so I can't use static non-relative offsets in my code either.
I want all my RIP-relative offsets in the code to be only within the binary file size range, so for example if my binary is 0x10000 bytes size, and I load it at address 0x200000, I don't want any offsets to go beyond the address 0x210000. Is there a way to tell the linker to do that somehow?

What would happen if a system executes a part of the file that is zero-padded?

I've seen in some posts/videos/files that they are zero-padded to look bigger than they are, or match "same file size" criteria some file system utilities have for moving files, mostly they are either prank programs, or malware.
But I often wondered, what would happen if the file corrupted, and would "load" the next set of "instructions" that are in the big zero-padded space at the end of the file?
Would anything happen? What's the instruction set for 0x0?
The decoding of 0 bytes completely depends on the CPU architecture. On many architectures, instruction are fixed length (for example 32-bit), so the relevant thing would be 00 00 00 00 (using hexdump notation).
On most Linux distros, clang/llvm comes with support for multiple target architectures built-in (clang -target and llvm-objdump), unlike gcc / gas / binutils, so I was able to use that to check for some architectures I didn't have cross-gcc / binutils installed for. Use llvm-objdump --version to see the supported list. (But I didn't figure out how to get it to disassemble a raw binary like binutils objdump -b binary, and my clang won't create SPARC binaries on its own.)
On x86, 00 00 (2 bytes) decodes (http://ref.x86asm.net/coder32.html) as an 8-bit add with a memory destination. The first byte is the opcode, the 2nd byte is the ModR/M that specifies the operands.
This usually segfaults right away (if eax/rax isn't a valid pointer), or segfaults once execution falls off the end of the zero-padded part into an unmapped page. (This happens in real life because of bugs like falling off the end of _start without making an exit system call), although in those cases the following bytes aren't always all zero. e.g. data, or ELF metadata.)
x86 64-bit mode: ndisasm -b64 /dev/zero | head:
address machine code disassembly
00000000 0000 add [rax],al
x86 32-bit mode (-b32):
00000000 0000 add [eax],al
x86 16-bit mode: (-b16):
00000000 0000 add [bx+si],al
AArch32 ARM mode: cd /tmp && dd if=/dev/zero of=zero bs=16 count=1 && arm-none-eabi-objdump -z -D -b binary -marm zero. (Without -z, objdump skips over large blocks of all-zero and shows ...)
addr machine code disassembly
0: 00000000 andeq r0, r0, r0
ARM Thumb/Thumb2: arm-none-eabi-objdump -z -D -b binary -marm --disassembler-options=force-thumb zero
0: 0000 movs r0, r0
2: 0000 movs r0, r0
AArch64: aarch64-linux-gnu-objdump -z -D -b binary -maarch64 zero
0: 00000000 .inst 0x00000000 ; undefined
MIPS32: echo .long 0 > zero.S && clang -c -target mips zero.S && llvm-objdump -d zero.o
zero.o: file format ELF32-mips
Disassembly of section .text:
0: 00 00 00 00 nop
PowerPC 32 and 64-bit: -target powerpc and -target powerpc64. IDK if any extensions to PowerPC use the 00 00 00 00 instruction encoding for anything, or if it's still an illegal instruction on modern IBM POWER chips.
zero.o: file format ELF32-ppc (or ELF64-ppc64)
Disassembly of section .text:
0: 00 00 00 00 <unknown>
IBM S390: clang -c -target systemz zero.S
zero.o: file format ELF64-s390
Disassembly of section .text:
0: 00 00 <unknown>
2: 00 00 <unknown>

Why static string in .rodata section has a four dots prefix in GCC?

For the following code:
#include <stdio.h>
int main() {
printf("Hello World");
printf("Hello World1");
return 0;
}
the generated assembly for calling printf is as follows (64 bits):
400474: be 24 06 40 00 mov esi,0x400624
400479: bf 01 00 00 00 mov edi,0x1
40047e: 31 c0 xor eax,eax
400480: e8 db ff ff ff call 400460 <__printf_chk#plt>
400485: be 30 06 40 00 mov esi,0x400630
40048a: bf 01 00 00 00 mov edi,0x1
40048f: 31 c0 xor eax,eax
400491: e8 ca ff ff ff call 400460 <__printf_chk#plt>
And the .rodata section is as follows:
Contents of section .rodata:
400620 01000200 48656c6c 6f20576f 726c6400 ....Hello World.
400630 48656c6c 6f20576f 726c6431 00 Hello World1.
Based on the assembly code, the first call for printf has the argument with address 400624 which has a 4 byte offset from the start of .rodata. I know it skips the first 4 bytes for these 4 dots prefix here. But my question is why GCC/linker produce this prefix for string in .rodata ? I am using 4.8.4 GCC on Ubuntu 14.04. The compilation cmd is just: gcc -Ofast my-source.c -o my-program.
For starters, those are not four dots, the dot just means unprintable character. You can see in the hex dump that those bytes are 01 00 02 00.
The final program contains other object files added by the linker, which are part of the C runtime library. This data is used by code there.
You can see the address is 0x400620. You can then try to find a matching symbol, for example you can load it into gdb and use the info symbol command:
(gdb) info symbol 0x4005f8
_IO_stdin_used in section .rodata of /tmp/a.out
(Note I had a different address.)
Taking it further, you can actually find the source for this in glibc:
/* This records which stdio is linked against in the application. */
const int _IO_stdin_used = _G_IO_IO_FILE_VERSION;
and
#define _G_IO_IO_FILE_VERSION 0x20001
Which corresponds to the value you see if you account for little-endian storage.
It does not prefix the data. The .rodata can contain anything. The first four bytes are [seemingly] a string, but it just happens to link there (i.e. it's for something else). It is unrelated to your "Hello World"

Can objdump use bss variable names in text section?

I am using objdump to generate the disassembly of C code and wondering if there is a way to get the names of variables from the heap (.bss section) to be used in the .text section disassembly, rather than the hex addresses.
For example,
int main(void)
{
while (1)
{
request_ID = Receive(0, buffer, MSG_SIZE, 0);
}
return 0;
}
This is compiled, and then using objdump -D file.o I get the disassembly, including the .bss section:
400: 55 push ebp
401: 89 e5 mov ebp,esp
403: 83 ec 18 sub esp,0x18
...
43d: a1 cc 24 00 00 mov eax,ds:0x24cc
...
Disassembly of section .bss:
000024cc <request_pID>:
What I would like is for the hex addresses of variables to be replaced by their name:
43d: a1 cc 24 00 00 mov eax, <request_pID>
I could write a sed script or something similar to achieve this, but was wondering if there was a simpler option.
Even better would be for both the address and the variable name to be printed to aid in debugging.
43d: a1 cc 24 00 00 mov eax, ds:0x24cc <request_pID>
The code is for an operating system development being tested in Bochs, so if there is some other way of loading symbols into Bochs' debugger that would be a good workaround, although I would still like the objdump output to be created as well.
thanks, Paul

How can a linker determine the address of certain data in the .rodata section?

So the test platform is on Linux 32 bit.
I use gcc to generate a obj file of quickSort in this way:
gcc -S quickSort.c
and the generated quickSort.o is a relocatable ELF:
#file quickSort.o
quickSort.o: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), not stripped
I then use objdump to disassemble it :
objdump -d quickSort.o
and looking into the asm file generated, I am confused with this:
51: b8 00 00 00 00 mov $0x0,%eax
56: 89 04 24 mov %eax,(%esp)
59: e8 fc ff ff ff call 5a <main+0x5a>
5e: c7 44 24 3c 00 00 00 movl $0x0,0x3c(%esp)
The above code is call printf function and print out a string, and if I compile quicksort.c into quicksort.s, it should like this:
movl $.LC0, %eax
movl %eax, (%esp)
call printf
So by looking at the relocation table, I can easily find out the relation between "5a" and printf, and I am sure linker can use this way to relocate printf and substitute "fc ff ff ff" into the real address of printf,
But I am confused with how the address of .LC0 (which is a string in the .rodata section) be relocated? I cannot find any clue in the relation table (I got the relocation table using readelf -r quickSort.o)
Could anyone give me some help about how the linker will find the real memory address of some data in the .rodata section?
It's done in the same way. You should be seeing a relocation entry against .rodata, where .rodata means the start address of the part of .rodata that's in the current object file.
Note that objdump -dr might be a better tool for the job.

Resources