Where in object file does the code of function "main" starts?

Where in object file does the code of function "main" starts? - c

I have an object file of a C program which prints hello world, just for the question.
I am trying to understand using readelf utility or gdb or hexedit(I can't figure which tool is a correct one) where in the file does the code of function "main" starts.
I know using readelf that symbol _start & main occurs and the address where it is mapped in a virtual memory. Moreover, I also know what the size of .text section and the of coruse where entry point specified, i.e the address which the same of text section.
The question is - Where in the file does the code of function "main" starts? I tought that is the entry point and the offset of the text section but how I understand it the sections data, bss, rodata should be ran before main and it appears after section text in readelf.
Also I tought we should sum the size all the lines till main in symbol table, but I am not sure at all if it is correct.
Additional question which follow up this one is if I want to replace main function with NOP instrcutres or plant one ret instruction in my object file. how can I know the offset where I can do it using hexedit.

So, let's go through it step by step.
Start with this C file:
#include <stdio.h>
void printit()
{
puts("Hello world!");
}
int main(void)
{
printit();
return 0;
}
As the comments look like you are on x86, compile it as 32-bit non-PIE executable like this:
$ gcc -m32 -no-pie -o test test.c
The -m32 option is needed, because I am working at a x86-64 machine. As you already know, you can get the virtual memory address of main using readelf, objdump or nm, for example like this:
$ nm test | grep -w main
0804918d T main
Obviously, 804918d can not be an offset in the file that is just 15 kB big. You need to find the mapping between virtual memory addresses and file offsets. In a typical ELF file, the mapping is included twice. Once in a detailed form for linkers (as object files are also ELF files) and debuggers, and a second time in a condensed form that is used by the kernel for loading programs. The detailed form is the list of sections, consisting of section headers, and you can view it like this (the output is shortened a bit, to make the answer more readable):
$ readelf --section-headers test
There are 29 section headers, starting at offset 0x3748:
Section Headers:
[Nr] Name Type Addr Off Size ES Flg Lk Inf Al
[...]
[11] .init PROGBITS 08049000 001000 000020 00 AX 0 0 4
[12] .plt PROGBITS 08049020 001020 000030 04 AX 0 0 16
[13] .text PROGBITS 08049050 001050 0001c1 00 AX 0 0 16
[14] .fini PROGBITS 08049214 001214 000014 00 AX 0 0 4
[15] .rodata PROGBITS 0804a000 002000 000015 00 A 0 0 4
[...]
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
L (link order), O (extra OS processing required), G (group), T (TLS),
C (compressed), x (unknown), o (OS specific), E (exclude),
p (processor specific)
Here you find that the .text section starts at (virtual) address 08049050 and has a size of 1c1 bytes, so it ends at address 08049211. The address of main, 804918d is in this range, so you know main is a member of the text section. If you subtract the base of the text section from the address of main, you find that main is 13d bytes into the text section. The section listing also contains the file offset where the data for the text section starts. It's 1050, so the first byte of main is at offset 0x1050 + 0x13d == 0x118d.
You can do the same calculation using program headers:
$ readelf --program-headers test
[...]
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
PHDR 0x000034 0x08048034 0x08048034 0x00160 0x00160 R 0x4
INTERP 0x000194 0x08048194 0x08048194 0x00013 0x00013 R 0x1
[Requesting program interpreter: /lib/ld-linux.so.2]
LOAD 0x000000 0x08048000 0x08048000 0x002e8 0x002e8 R 0x1000
LOAD 0x001000 0x08049000 0x08049000 0x00228 0x00228 R E 0x1000
LOAD 0x002000 0x0804a000 0x0804a000 0x0019c 0x0019c R 0x1000
LOAD 0x002f0c 0x0804bf0c 0x0804bf0c 0x00110 0x00114 RW 0x1000
[...]
The second load line tells you that the area 08049000 (VirtAddr) to 08049228 (VirtAddr + MemSiz) is readable and executable, and loaded from offset 1000 in the file. So again you can calculate that the address of main is 18d bytes into this load area, so it has to reside at offset 0x118d inside the executable. Let's test that:
$ ./test
Hello world!
$ echo -ne '\xc3' | dd of=test conv=notrunc bs=1 count=1 seek=$((0x118d))
1+0 records in
1+0 records out
1 byte copied, 0.0116672 s, 0.1 kB/s
$ ./test
$
Overwriting the first byte of main with 0xc3, the opcode for return (near) on x86, causes the program to not output anything anymore.

_start normally belongs to a module ( a *.o file) that is fixed (it is called differently on different systems, but a common name is crt0.o which is written in assembler.) That fixed code prepares the stack (normally the arguments and the environment are stored in the initial stack segment by the execve(2) system call) the mission of crt0.s is to prepare the initial C stack frame and call main(). Once main() ends, it is responsible of getting the return value from main and calling all the atexit() handlers to finish calling the _exit(2) system call.
The linking of crt0.o is normally transparent due to the fact that you always call the compiler to do the linking itself, so you normally don't have to add crt0.o as the first object module, but the compiler knows (lately, all this stuff has grown considerably, since we depend on architecture and ABIs to pass parameters between functions)
If you execute the compiler with the -v option, you'll get the exact command line it uses to call the linker and you'll get the secrets of the final memory map your program has on its first stages.

Related

cant view C structures from an assembly program? [duplicate]

I recently started learning assembly language for the Intel x86-64 architecture using YASM. While solving one of the tasks suggested in a book (by Ray Seyfarth) I came to following problem:
When I place some characters into a buffer in the .bss section, I still see an empty string while debugging it in gdb. Placing characters into a buffer in the .data section shows up as expected in gdb.
segment .bss
result resb 75
buf resw 100
usage resq 1
segment .data
str_test db 0, 0, 0, 0
segment .text
global main
main:
mov rbx, 'A'
mov [buf], rbx ; LINE - 1 STILL GET EMPTY STRING AFTER THAT INSTRUCTION
mov [str_test], rbx ; LINE - 2 PLACES CHARACTER NICELY.
ret
In gdb I get:
after LINE 1: x/s &buf, result - 0x7ffff7dd2740 <buf>: ""
after LINE 2: x/s &str_test, result - 0x601030: "A"
It looks like &buf isn't evaluating to the correct address, so it still sees all-zeros. 0x7ffff7dd2740 isn't in the BSS of the process being debugged, according to its /proc/PID/maps, so that makes no sense. Why does &buf evaluate to the wrong address, but &str_test evaluates to the right address? Neither are "global" symbols, but we did build with debug info.
Tested with GNU gdb (Ubuntu 7.10-1ubuntu2) 7.10 on x86-64 Ubuntu 15.10.
I'm building with
yasm -felf64 -Worphan-labels -gdwarf2 buf-test.asm
gcc -g buf-test.o -o buf-test
nm on the executable shows the correct symbol addresses:
$ nm -n buf-test # numeric sort, heavily edited to omit symbols from glibc
...
0000000000601028 D __data_start
0000000000601038 d str_test
...
000000000060103c B __bss_start
0000000000601040 b result
000000000060108b b buf
0000000000601153 b usage
(editor's note: I rewrote a lot of the question because the weirdness is in gdb's behaviour, not the OP's asm!).

glibc includes a symbol named buf, as well.
(gdb) info variables ^buf$
All variables matching regular expression "^buf$":
File strerror.c:
static char *buf;
Non-debugging symbols:
0x000000000060108b buf <-- this is our buf
0x00007ffff7dd6400 buf <-- this is glibc's buf
gdb happens to choose the symbol from glibc over the symbol from the executable. This is why ptype buf shows char *.
Using a different name for the buffer avoids the problem, and so does a global buf to make it a global symbol. You also wouldn't have a problem if you wrote a stand-alone program that didn't link libc (i.e. define _start and make an exit system call instead of running a ret)
Note that 0x00007ffff7dd6400 (address of buf on my system; different from yours) is not actually a stack address. It visually looks like a stack address, but it's not: it has a different number of f digits after the 7. Sorry for that confusion in comments and an earlier edit of the question.
Shared libraries are also loaded near the top of the low 47 bits of virtual address space, near where the stack is mapped. They're position-independent, but a library's BSS space has to be in the right place relative to its code. Checking /proc/PID/maps again more carefully, gdb's &buf is in fact in the rwx block of anonymous memory (not mapped to any file) right next to the mapping for libc-2.21.so.
7ffff7a0f000-7ffff7bcf000 r-xp 00000000 09:7f 17031175 /lib/x86_64-linux-gnu/libc-2.21.so
7ffff7bcf000-7ffff7dcf000 ---p 001c0000 09:7f 17031175 /lib/x86_64-linux-gnu/libc-2.21.so
7ffff7dcf000-7ffff7dd3000 r-xp 001c0000 09:7f 17031175 /lib/x86_64-linux-gnu/libc-2.21.so
7ffff7dd3000-7ffff7dd5000 rwxp 001c4000 09:7f 17031175 /lib/x86_64-linux-gnu/libc-2.21.so
7ffff7dd5000-7ffff7dd9000 rwxp 00000000 00:00 0 <--- &buf is in this mapping
...
7ffffffdd000-7ffffffff000 rwxp 00000000 00:00 0 [stack] <---- more FFs before the first non-FF than in &buf.
A normal call instruction with a rel32 encoding can't reach a library function, but it doesn't need to because GNU/Linux shared libraries have to support symbol interposition, so calls to library functions actually jump to the PLT, where an indirect jmp (with a pointer from the GOT) goes to the final destination.

Why are instructions addresses on the top of the memory user space contrary to the linux process memory layout?

#include <stdio.h>
void func() {}
int main() {
printf("%p", &func);
return 0;
}
This program outputted 0x55c4cda9464a
Supposing that func will be stored in the .text section, and according to this figure, from CS:APP:
I suppose that the address of func would be somewhere near the starting address of the .text section, but this address is somewhere in the middle. Why is this the case? Local variables stored on the stack have addresses near 2^48 - 1, but I tried to disassemble different C codes and the instructions were always located somewhere around that 0x55... address.

gcc, when configured with --enable-default-pie1 (which is the default), produces Position Independent Executables(PIE). Which means the load address isn't same as what linker specified at compile-time (0x400000 for x86_64). This is a security mechanism so that Address Space Layout Randomization (ASLR) 2 can be enabled. That is, gcc compiles with -pie option by default.
If you compile with -no-pie option (gcc -no-pie file.c), then you can see the address of func is closer to 0x400000.
On my system, I get:
$ gcc -no-pie t.c
$ ./a.out
0x401132
You can also check the load address with readelf:
$ readelf -Wl a.out | grep LOAD
LOAD 0x000000 0x0000000000400000 0x0000000000400000 0x000478 0x000478 R 0x1000
LOAD 0x001000 0x0000000000401000 0x0000000000401000 0x0001f5 0x0001f5 R E 0x1000
LOAD 0x002000 0x0000000000402000 0x0000000000402000 0x000158 0x000158 R 0x1000
LOAD 0x002e10 0x0000000000403e10 0x0000000000403e10 0x000228 0x000230 RW 0x1000
1 you can check this with gcc --verbose.
2 You may also notice that address printed by your program is different in each run. That's because of ASLR. You can disable it with:
$ echo 0 | sudo tee /proc/sys/kernel/randomize_va_space
ASLR is enabled on Linux by default.

How do I add contents of text file as a section in an ELF file?

I have a NASM assembly file that I am assembling and linking (on Intel-64 Linux).
There is a text file, and I want the contents of the text file to appear in the resulting binary (as a string, basically). The binary is an ELF executable.
My plan is to create a new readonly data section in the ELF file (equivalent to the conventional .rodata section).
Ideally, there would be a tool to add a file verbatim as a new section in an elf file, or a linker option to include a file verbatim.
Is this possible?

This is possible and most easily done using OBJCOPY found in BINUTILS. You effectively take the data file as binary input and then output it to an object file format that can be linked to your program.
OBJCOPY will even produce a start and end symbol as well as the size of the data area so that you can reference them in your code. The basic idea is that you will want to tell it your input file is binary (even if it is text); that you will be targeting an x86-64 object file; specify the input file name and the output file name.
Assume we have an input file called myfile.txt with the contents:
the
quick
brown
fox
jumps
over
the
lazy
dog
Something like this would be a starting point:
objcopy --input binary \
--output elf64-x86-64 \
--binary-architecture i386:x86-64 \
myfile.txt myfile.o
If you wanted to generate 32-bit objects you could use:
objcopy --input binary \
--output elf32-i386 \
--binary-architecture i386 \
myfile.txt myfile.o
The output would be an object file called myfile.o . If we were to review the headers of the object file using OBJDUMP and a command like objdump -x myfile.o we would see something like this:
myfile.o: file format elf64-x86-64
myfile.o
architecture: i386:x86-64, flags 0x00000010:
HAS_SYMS
start address 0x0000000000000000
Sections:
Idx Name Size VMA LMA File off Algn
0 .data 0000002c 0000000000000000 0000000000000000 00000040 2**0
CONTENTS, ALLOC, LOAD, DATA
SYMBOL TABLE:
0000000000000000 l d .data 0000000000000000 .data
0000000000000000 g .data 0000000000000000 _binary_myfile_txt_start
000000000000002c g .data 0000000000000000 _binary_myfile_txt_end
000000000000002c g *ABS* 0000000000000000 _binary_myfile_txt_size
By default it creates a .data section with contents of the file and it creates a number of symbols that can be used to reference the data.
_binary_myfile_txt_start
_binary_myfile_txt_end
_binary_myfile_txt_size
This is effectively the address of the start byte, the end byte, and the size of the data that was placed into the object from the file myfile.txt. OBJCOPY will base the symbols on the input file name. myfile.txt is mangled into myfile_txt and used to create the symbols.
One problem is that a .data section is created which is read/write/data as seen here:
Idx Name Size VMA LMA File off Algn
0 .data 0000002c 0000000000000000 0000000000000000 00000040 2**0
CONTENTS, ALLOC, LOAD, DATA
You specifically are requesting a .rodata section that would also have the READONLY flag specified. You can use the --rename-section option to change .data to .rodata and specify the needed flags. You could add this to the command line:
--rename-section .data=.rodata,CONTENTS,ALLOC,LOAD,READONLY,DATA
Of course if you want to call the section something other than .rodata with the same flags as a read only section you can change .rodata in the line above to the name you want to use for the section.
The final version of the command that should generate the type of object you want is:
objcopy --input binary \
--output elf64-x86-64 \
--binary-architecture i386:x86-64 \
--rename-section .data=.rodata,CONTENTS,ALLOC,LOAD,READONLY,DATA \
myfile.txt myfile.o
Now that you have an object file, how can you use this in C code (as an example). The symbols generated are a bit unusual and there is a reasonable explanation on the OS Dev Wiki:
A common problem is getting garbage data when trying to use a value defined in a linker script. This is usually because they're dereferencing the symbol. A symbol defined in a linker script (e.g. _ebss = .;) is only a symbol, not a variable. If you access the symbol using extern uint32_t _ebss; and then try to use _ebss the code will try to read a 32-bit integer from the address indicated by _ebss.
The solution to this is to take the address of _ebss either by using it as &_ebss or by defining it as an unsized array (extern char _ebss[];) and casting to an integer. (The array notation prevents accidental reads from _ebss as arrays must be explicitly dereferenced)
Keeping this in mind we could create this C file called main.c:
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
/* These are external references to the symbols created by OBJCOPY */
extern char _binary_myfile_txt_start[];
extern char _binary_myfile_txt_end[];
extern char _binary_myfile_txt_size[];
int main()
{
char *data_start = _binary_myfile_txt_start;
char *data_end = _binary_myfile_txt_end;
size_t data_size = (size_t)_binary_myfile_txt_size;
/* Print out the pointers and size */
printf ("data_start %p\n", data_start);
printf ("data_end %p\n", data_end);
printf ("data_size %zu\n", data_size);
/* Print out each byte until we reach the end */
while (data_start < data_end)
printf ("%c", *data_start++);
return 0;
}
You can compile and link with:
gcc -O3 main.c myfile.o
The output should look something like:
data_start 0x4006a2
data_end 0x4006ce
data_size 44
the
quick
brown
fox
jumps
over
the
lazy
dog
A NASM example of usage is similar in nature to the C code. The following assembly program called nmain.asm writes the same string to standard output using Linux x86-64 System Calls:
bits 64
global _start
extern _binary_myfile_txt_start
extern _binary_myfile_txt_end
extern _binary_myfile_txt_size
section .text
_start:
mov eax, 1 ; SYS_Write system call
mov edi, eax ; Standard output FD = 1
mov rsi, _binary_myfile_txt_start ; Address to start of string
mov rdx, _binary_myfile_txt_size ; Length of string
syscall
xor edi, edi ; Return value = 0
mov eax, 60 ; SYS_Exit system call
syscall
This can be assembled and linked with:
nasm -f elf64 -o nmain.o nmain.asm
gcc -m64 -nostdlib nmain.o myfile.o
The output should appear as:
the
quick
brown
fox
jumps
over
the
lazy
dog

How do I see the memory locations of static variables within .bss?

Supposing I have a static variable declared in gps_anetenova_m10478.c as follows:
static app_timer_id_t m_gps_response_timeout_timer_id;
I have some sort of buffer overrun bug in my code and at some point a write to the variable right before m_gps_response_timeout_timer_id in memory is overwriting it.
I can find out where m_gps_response_timeout_timer_id is in memory using the 'Expressions' view in Eclipse's GDB client. Just enter &m_gps_response_timeout_timer_id. But how do I tell which variable is immediately before it in memory?
Is there a way to get this info into the .map file that ld produces? At the moment I only see source files:
.bss 0x000000002000011c 0x0 _build/debug_leds.o
.bss 0x000000002000011c 0x11f8 _build/gps_antenova_m10478.o
.bss 0x0000000020001314 0x161c _build/gsm_ublox_sara.o

I'll be honest, I don't know enough about Eclipse to give an easy way within Eclipse to get this. The tool you're probably looking for is either objdump or nm. An example with objdump is to simply run objdump -x <myELF>. This will then return all symbols in the file, which section they're in, and their addresses. You'll then have to manually search for the variable in which you're interested based on the addresses.
objdump -x <ELFfile> will give output along the lines of the following:
000120d8 g F .text 0000033c bit_string_copy
00015ea4 g O .bss 00000004 overflow_bit
00015e24 g .bss 00000000 __bss_start
00011ce4 g F .text 0000003c main
00014b6c g F .text 0000008c integer_and
The first column is the address, the fourth the section and the fifth the length of that field.
nm <ELFfile> gives the following:
00015ea8 B __bss_end
00015e24 B __bss_start
0000c000 T _start
00015e20 D zero_constant
00015e24 b zero_constant_itself
The first column is the address and the second the section. D/d is data, B/b is BSS and T/t is text. The rest can be found in the manpage. nm also accepts the -n flag to sort the lines by their numeric address.

Location of destructors in elf files: not where it should be?

I register a token destructor function with
static void cleanup __attribute__ ((destructor));
The function just prints a debug message; the token program runs fine (main() just prints another message; token function prints upon exit).
When I look at the file with
nm ./a.out,
I see: 
08049f10 d __DTOR_END__
08049f0c d __DTOR_LIST__
However, the token destructor function's address should be at 0x08049f10 - an address which contains 0, indicating end of destructor list, as I can check using:
objdump -s ./a.out
At 0x08049f0c, I see 0xffffffff, as is expected for this location. It is my understanding that what I see in the elf file would mean that no destructor was registered; but it is executed with one.
If someone could explain, I'd appreciate. Is this part of the security suite to prevent inserting malicious destructors? How does the compiler keep track of the destructors' addresses?
My system:
Ubuntu 12.04.
elf32-i386
Kernel: 3.2.0-30-generic-pae
gcc version: 4.6.3

DTOR_LIST is the start of a table of desctructors. Have a look which section it is in (probably .dtors):
~> objdump -t test | grep DTOR_LIST
0000000000600728 l O .dtors 0000000000000000 __DTOR_LIST__
Then dump that section with readelf (or whatever):
~> readelf --hex-dump=.dtors test
Hex dump of section '.dtors':
0x00600728 ffffffff ffffffff 1c054000 00000000 ..........#.....
0x00600738 00000000 00000000 ........
Which in my test case contains a couple of presumably -1, a real pointer, and then zero termination.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight