So I'm working with the linux 0.11 kernel on a virtual machine, and I need to write a program that analyses executable files that are ran on that kernel. The files are in the format a.out. What I want to know is, how does the operating system decide where to load the file in (virtual?) memory? Is it decided by something called "base address", and if so, how come I can't seem to find any mention of it in the a.out header?
//where is base address?
struct exec {
unsigned long a_magic; /* Use macros N_MAGIC, etc for access */
unsigned a_text; /* length of text, in bytes */
unsigned a_data; /* length of data, in bytes */
unsigned a_bss; /* length of uninitialized data area for file, in bytes */
unsigned a_syms; /* length of symbol table data in file, in bytes */
unsigned a_entry; /* start address */
unsigned a_trsize; /* length of relocation info for text, in bytes */
unsigned a_drsize; /* length of relocation info for data, in bytes */
};
I tried looking for documentations about the format, but the only information I found just explains what each of these fields are, what values a_magic can have, etc.
I need to know about it because the program needs to print out file and line numbers when given an address in memory of an instruction in the executable, and the debug symbols only have their addresses as offsets (e.g. relative to the start of text section, etc).
Also, out of curiosity, I know that in C, "(void*)0" is NULL, which you can't dereference. How then would you get the content of memory address 0?
As you see, I know very little about linux kernel and operating systems in general, so please start from the basics...
I appreciate any help you can give, thanks.
The base address is the a_entry field.
Also, out of curiosity, I know that in C, "(void*)0" is NULL, which you can't dereference. How then would you get the content of memory address 0?
Any system that puts memory usable by a C program at address zero would have to make it work, somehow. While one can imagine possible ways to do this, I don't know of anyone who bothers. Virtual address zero is, for all intents and purposes, never used.
The operating system can load the application at any location it chooses and then relocate the embedded addresses to be relative to that point. This relocation information is recorded in the a.out file. The base address depends on the architecture and other details and is often non-zero.
If you look at a linker map file, you should see a symbol that is either at the beginning of the memory image, or at a fixed offset from it. At runtime, subtract this value from the actual addresses you note for debugging to get to the relative address of the position you are interested in.
Note, the symbols will not be present in the executable if your linker script strips them.
Also, out of curiosity, I know that in C, "(void*)0" is NULL, which you can't dereference. How then would you get the content of memory address 0?
Actually you can dereference NULL, but the results are not defined. For convenience, most operating systems trap the access to help you debug pointer problems.
Also, memory location with address 0 in a process space is different from the memory location with address 0 in the 'hardware space'. The 'pagination' support in the CPU and operating system are 'decoupling' the physical memory from the virtual memory. It could happen that a virtual page be mapped at address 0, although there you usually have interrupt vectors and other special device memory and not real RAM anyway.
Related
I wrote a simple code trying to find out if we can read and print the memory in code segment:
#include <stdio.h>
void main() {
int *code_ptr = 0x4;
printf("code_ptr = %x\n", code_ptr);
printf("*code_ptr = %x\n", *code_ptr);
}
My system is x86_64 + Ubuntu 19.04 (Disco Dingo). And the program failed with the following output:
code_ptr = 4
Segmentation fault (core dumped)
IIUC, in Linux, the code segment and data segment share the same base address. And if that's true, this program will read the memory in code segement, and I was expecting that there won't be any crash since 0x04 should be in the range of data segment (which starts at the beginning). And this should pass the paging check since the mapped memory for the code segment is read-only and we only read the memory here.
So did I miss anything or is there any other mechanisms that prevent us from reading from this %ds:0x4?
I think your key misunderstanding is that you're assuming the 8086 hardware feature called "the data segment" is the same as the executable image subdivision also called "the data segment." Xenix may have used that hardware feature that way, but no modern x86 Unix does. On a modern Unix, %ds:0 always points to linear address zero, not to the beginning of the executable's data segment. (And similarly %cs:0 points to linear address zero, not to the executable's text segment.)
All of an executable's segments will be loaded into linear address space somewhere well above linear address 0, and on current-generation OSes the load addresses will be randomized on each run.
There's no standard way to get a pointer to the beginning of the executable's code or data segment. On GNU systems you can use dl_iterate_phdr, and other OSes may have similar functionality under a different name.
I have a c program that looks like this
main.c
#include <stdio.h>
#define SOME_VAR 10
static int heap[SOME_VAR];
int main(void) {
printf("%p", heap);
return 0;
}
and outputs this when I run the compiled program a few times
0x58aa7c49060
0x56555644060
0x2f8d1f8e060
0x92f58280060
0x59551c53060
0xd474ed6e060
0x767c4561060
0xf515aeda060
0xbe62367e060
Why does it always end in 060? And is the array stored in heap?
Edit: I am on Linux and I have ASLR on. I compiled the program using gcc
The addresses differ because of ASLR (Address space layout ramdomization). Using this, the binary can be mapped at different locations in the virtual address space.
The variable heap is - in contrast to it's name - not located on the heap, but on the bss. The offset in the address space is therefore constant.
Pages are mapped at page granularity, which is 4096 bytes (hex: 0x1000) on many platforms. This is the reason, why the last three hex digits of the address is the same.
When you did the same with a stack variable, the address could even vary in the last digits on some platforms (namely linux with recent kernels), because the stack is not only mapped somewhere else but also receives a random offset on startup.
If you are using Windows, the reason is PE structure.
Your heap variable is stored in .data section of file and its address is calculated based on start of this section. Each section is loaded in an address independently, but its starting address is multiple of page size. Because you have no other variables, its address is probably start of .data section, so its address will be multiple of chunk size.
For example, this is the table of the compiled Windows version of your code:
The .text section is were your compiled code is and .data contains your heap variable. When your PE is loaded into memory, sections are loaded in different address and which is returned by VirtualAlloc() and will be multiple of page size. But address of each variable is relative to start of section that is now a page size. So you will always see a fixed number on lower digits. Since the relative address of heap from start of section is based on compiler, compile options, etc. you will see different number from same code but different compilers, but every time what will be printed is fixed.
When I compile code, I noticed heap is placed on 0x8B0 bytes after start of .data section. So every time that I run this code, my address end in 0x8B0.
The compiler happened to put heap at offset 0x60 bytes in a data segment it has, possibly because the compiler has some other stuff in the first 0x60 bytes, such as data used by the code that starts the main routine. That is why you see “060”; it is just where it happened to be, and there is no great significance to it.
Address space layout randomization changes the base address(es) used for various parts of program memory, but it always does so in units of 0x1000 bytes (because this avoids causing problems with alignment and other issues). So you see the addresses fluctuate by multiples of 0x1000, but the last three digits do not change.
The definition static int heap[SOME_VAR]; defines heap with static storage duration. Typical C implementations store it in a general data section, not in the heap. The “heap” is a misnomer for memory that is used for dynamic allocation. (It is a misnomer because malloc implementations may use a variety of data structures and algorithms, not limited to heaps. They may even use multiple methods in one implementation.)
This question already has an answer here:
Why do virtual memory addresses for linux binaries start at 0x8048000?
(1 answer)
Closed 8 years ago.
I am looking into to the memory layout of a given process. I notice that the starting memory location of each process is not 0. On this website, TEXT starts at 0x08048000. One reason can be to distinguish the address with the NULL pointer. I am just wondering if there is any another good reasons? Thanks.
The null pointer doesn't actually have to be 0. It's guaranteed in the C standard that when a 0 value is given in the context of a pointer it's treated as NULL by the compiler.
But the 0 that you use in your source code is just syntactic sugar that has no relation to the actual physical address the null-pointer value is "pointing" to.
For further details see:
Why is NULL/0 an illegal memory location for an object?
Why is address zero used for the null pointer?
An application on your operating system has its unique address space, which it sees as a continuous block of memory (the memory isn't physically continuous, it's just "the impression" the operating system gives to every program).
For the most part, each process's virtual memory space is laid out in a similar and predictable manner (this is the memory layout in a Linux process, 32-bit mode):
(image from Anatomy of a Program in Memory)
Look at the text segment (the default .text base on x86 is 0x08048000, chosen by the default linker script for static binding).
Why the magical 0x08048000? Likely because Linux borrowed that address from the System V i386 ABI.
... and why then did System V use 0x08048000?
The value was chosen to accommodate the stack below the .text section,
growing downward. The 0x48000 bytes could be mapped by the same page
table already required by the .text section (thus saving a page table
in most cases), while the remaining 0x08000000 would allow more room
for stack-hungry applications.
Is there anything below 0x08048000? There could be nothing (it's only 128M), but you can pretty much map anything you desire there, using the mmap() system call.
See also:
What's the memory before 0x08048000 used for in 32 bit machine?
Reorganizing the address space
mmap
I think this sums it up:
Each process has its own set of page tables, but there is a catch. Once virtual addresses are enabled, they apply to all software running in the machine, including the kernel itself. Thus a portion of the virtual address space must be reserved to the kernel.
So while the process gets it's own address space. Without allocating a block to the kernel, it would not be able to address kernel code and data.
This is always the first block of memory it appears and so includes address 0. The user mode space starts beyond this, and so that is where both the stack and heap reside.
Distinguishing from NULL pointer
Even if the user mode space started at address 0, there would not be any data allocated to the address 0 as that will be in the stack or the heap which themselves do not start at the beginning of the user area. Therefore NULL (with the value of 0) could be used still and is not a reason for this layout.
However one benefit related to the NULL and the first block being kernel memory is any attempt to read/write to NULL throws a Segmentation Fault.
A loader loads a binary in segments into memory: text (constants), data, code. There is no need to start from 0, and as C is has the problem from bugs accessing around null, like in a[i] that is even dangerous. This allows (on some processors) to intercept segmentation faults.
It would be the C runtime introducing a linear address space from 0. That might be imaginable where C is the operating system's implementation language. But serves no purpose; to have the heap start from 0. The memory model is one of segments. A code segment might be protected against modification by some processors.
And in segments allocation happens in C runtime managed memory blocks.
I might add, that physical 0 and upwards is often used by the operating system itself.
This was a question asked by an interviewer:
#include<stdio.h>
int main()
{
char *c="123456";
printf("%d\n",c);
return 0;
}
This piece of code always prints a fixed number (e.g. 13451392), no matter how many times you execute it. Why?
Your code contains undefined behavior: printing a pointer needs to be done using %p format specifier, and only after converting it to void*:
printf("%p\n", (void*)c);
This would produce a system-dependent number, which may or may not be the same on different platforms.
The reason that it is fixed on your platform is probably that the operating system always loads your executable into the same spot of virtual memory (which may be mapped to different areas of physical memory, but your program would never know). String literal, which is part of the executable, would end up in the same spot as well, so the printout would be the same all the time.
To answer your question, the character string "123456" is a static constant in memory, and when the .exe is loaded, it always goes into the same memory location.
What c is (or rather what it contains) is the memory address of that character string which, as I said, is always at the same location. If you print the address as a decimal number, you see the address, in decimal.
Of course, as #dasblinkenlight said, you should print it as a pointer, because different machines/languages have different conventions about the size of pointers versus the size of ints.
Most executable file formats have an option to tell the OS loader at which virtual address to load the executable, For example PE format used by Windows has ImageBase field for this and usually sets to 0x00400000 for applications.
When the loader first load the executable, it tries load it at that address, if it's not used, it load it at it, which is mostly true, but if it's used. It load it at different address given by the system.
The case here is that the offset to your "12345" in the data section is the same, and OS loads the image base at the same base address, so you always get the same virtual address, base + offset.
But this is not always the case, one for the given reason above, the base address may be used, alot of Windows DLLs compile using MSVC sets their base address to 0x10000000, so only one or none is actually loaded at that address.
Another case is when there is address space randomization ASLR, security feature, if it is supported and enabled by the system, MSVC has the linker option /DYNAMICBASE, the system will ignore the specified image base and will give you different random address on its own.
Two things to conclude:
You should not depend on this behavior, the system can load your program at any address and thus you will give different address.
Use %p for printing address, on some systems, for example, int is 4 bytes and pointers are 8 bytes, part of you address will be chopped.
So I'm looking through my C programming text book and I see this code.
#include <stdio.h>
int j, k;
int *ptr;
int main(void)
{
j = 1;
k = 2;
ptr = &k;
printf("\n");
printf("j has the value %d and is stored at %p\n", j, (void *)&j);
printf("k has the value %d and is stored at %p\n", k, (void *)&k);
printf("ptr has the value %p and is stored at %p\n", (void *)ptr, (void *)&ptr);
printf("The value of the integer pointed to by ptr is %d\n", *ptr);
return 0;
}
I ran it and the output was:
j has the value 1 and is stored at 0x4030e0
k has the value 2 and is stored at 0x403100
ptr has the value 0x403100 and is stored at 0x4030f0
The value of the integer pointed to by ptr is 2
My question is if I had not ran this through a compiler, how would you know the address to those variables by just looking at this code? I'm just not sure how to get the actual address of a variable. Thanks!
Here's my understanding of it:
The absolute addresses of things in memory in C is unspecified. It's not standardised into the language. Because of this, you can't know the locations of things in memory by looking at just the code. (However, if you use the same compiler, code, compiler options, runtime and operating system, the addresses may be consistent.)
When you're developing applications, this is not behaviour you should rely on. You may rely on the difference between the locations of two things in some contexts, however. For example, you can determine the difference between the addresses of pointers to two array elements to determine how many elements apart they are.
By the way, if you are considering using the memory locations of variables to solve a particular problem, you may find it helpful to post a separate question asking how to so without relying on this behaviour.
There is no other way to "know the exact address" of a variable in Standard C than to print it with "%p". The actual address is determined by many factors not under control of the programmer writing code. It's a matter of OS, the linker, the compiler, options used and probably others.
That said, in the embedded systems world, there are ways to express this variable must reside at this address, for example if registers of external devices are mapped into the address space of a running program. This usually happens in what is called a linker file or map file or by assigning an integral value to a pointer (with a cast). All of these methods are non-standard.
For the purpose of your everyday garden-variety programs though, the point of writing C programs is that you need and should not care where your variables are stored.
You can't.
Different compilers can put the variables in different places. On some machines the address is not a simple integer anyway.
The compiler only knows things like "the third integer global variable" and "the four bytes allocated 36 bytes down from the stack pointer." It refers to global vars, pointers to subroutines (functions), subroutine arguments and local vars only in relative terms. (Never mind the extra stuff for polymorphic objects in C++, yikes!) These relative references are saved in the object file (.o or .obj) as special codes and offset values.
The Linker can fill in some details. It may modify some of these sketchy location references when joining several object files. Global variable locations will share a space (the Data Section) when globals from multiple compilation units are merged; the linker decides what order they all go in, but still describing them as relative to the start of the entire set of global vars. The result is an executable file with the final opcodes, but addresses still being sketchy and based on relative offsets.
It's not until the executable is loaded that the Loader replaces all the relative addresses with actual addresses. This is possible now, because the loader (or some part of the operating system it depends on) decides where in the whole virtual address space of the process to store the program's opcodes (Text Section), global variables (BSS, Data Sections) and call stack, and other things. The loader can do the math, and write the actual address into every spot in the executable, typically as part of "load immediate" opcodes and all opcodes involving memory access.
Google "relocation table" for more. See http://www.iecc.com/linker/linker07.html (somewhat old) for a more detailed explanation for particular platforms.
In real life, it's all complicated by the fact that virtual addresses are mapped to physical addresses by a virtual memory system, using segments or some other mechanism to keep each process in a separate address space.
I would like to further build upon the answers already provided by pointing out that some compilers, such as Visual Studio's, have a feature called Address Space Layout Randomization (ASLR), which makes programs begin at a random memory address as an anti-virus feature. Given the addresses that you have in your output, I'd say that you compiled without it (programs without it start at address 0x400000, I think). My source for this information is an answer to this question.
That said, the compiler is what determines the memory addresses at which local variables will be stored. The addresses will most likely change from compiler to compiler, and probably also with each version of the source code.
Every process has its own logical address space starting from zero. Addressees your program can access are all relative to zero. Absolute address of any memory location is decided only after loading the process in main memory. This is done using dynamic relocation by modern operating systems. Hence every time a process is loaded into memory it may be loaded at different location according to availability of the memory. Hence allowing user processes to know exact address of data stored in memory does not make any sense. What your code is printing, is a logical address and not the exact or physical address.
Continuing on the answers described above, please do not forget that processes would run in their own virtual address space (process isolation). This ensures that when your program corrupts some memory, the other running processes are not affected.
Process Isolation:
http://en.wikipedia.org/wiki/Process_isolation
Inter-Process Communication
http://en.wikipedia.org/wiki/Inter-process_communication