Jump to specific address in loaded shared object file (loaded by the LD_PRELOAD) - arm

Suppose we have two programs. The program test is in binary form (ELF file), and the second program is a shred object file, say f.so.
I want to hook some instruction in the program test and move execution to particular instructions located in f.so file.? I do not want to make a function call. For example, I want to hook an instruction (binary instruction) in the test program and remove its 4 byte (ARM 32-bit arch) and write a new branch instruction that points to an instruction located in f.so shared object program.!!
is that possible using LD_PRELOAD?
Edit 1
Based on the helpful information provided (thanks in advance), I performed some experiments and I will explain that in some details (Sorry to provide pics not text) ...
Before instrumenting the target binary file, i.e, test, as this task maybe not easy particularly when we want to overwrite some bytes with new branch instruction. I performed some experiments to hook the control flow of the test program and to see if it is possible to move the execution to some place in f.so file. To this end and for simplicity, I used gdb as it is easy to modify the program counter register to points to other place.
In the following, I performed my test on binary files compiled for x86 architecture not for ARM, however, same task can be ported to ARM binaries.
The following image shows the core dump (disassembly) of (on left) shared object file f.so and (on right side) the target program, test.
My first question is why is there a difference in the memory addresses where the program test and f.so were loaded?
Then, I moved to GDB environment to patch a particular instruction in test binary and move the control (execution) to an arbitrary location in f.so. I used the following command and tested the code to ensure it run correctly by LD_PRELOAD in GDB environment
(gdb) set environment LD_PRELOAD=./f.so
(gdb) file ./test
Reading symbols from ./test...(no debugging symbols found)...done.
(gdb) r
Starting program: /home/null/Desktop/Edit_elf/test
hello
19
[Inferior 1 (process 5827) exited normally]
(gdb)
Ok, now I set a breakpoint at main function in test binary to check weather our f.so is really loaded. After breakpoint I disassembled the function func1 in f.so ... it looks like below:
As we can see, the f.so is loaded in some place in memory (I really don't know) ... these addresses is same as presented in image 1 with a prefix =b7fd2 (for example in image 1 the address of first instruction in f.so is 520, and the address of the same instruction in image 2 is b7fd2520). I don't know precisely the reason for that but I think it is due to virtual memory mapping and related things.
I listed the value of registers for our test program (it is stopped at predefined breakpoint) and changed the value of program counter (eip in x86) to move the execution after breakpoint to some place in f.so.
Then I let the test program to continue its execution. Now it is expected that the CPU will move the execution to the place where eip points to, i.e., 0xb7fd2526.
Wow ... as we can see from the above image, we are able to move the execution from test to some place in shared library loaded by LD_PRELOAD. However...
Why there is a difference in address map of test and f.so? In my final goal, I want to patch the test binary and remove some bytes and overwrite new bytes (machine code) to inject a branch machine instruction to point to shared object file. Therefore, I need to calculate the target address of branch instruction correctly to embedded it with written bytes (machine code) and the target address AFAIK is calculated by considering the address of current instruction with some constant value. So I feel it is hard to create a machine code for a branch instruction to move the control or execution from the memory space (like 804843b) to (b7fd2526) ...

is that possible using LD_PRELOAD?
Sure: the LD_PRELOADed library will have its initializers run at load time.
One of these initializers can binary-patch instructions inside the main executable (by doing mprotect(..., PROT_WRITE) + overwrite instruction bytes + mprotect(..., PROT_EXEC)).
If the main binary is non-PIE, you could hard-code the address to patch directly into the library.
If it is a PIE binary, you'll have to find where in memory it was loaded and adjust the offset accordingly.

Related

How are code segments and data segments of a source code program really handled and separated from each other during process execution?

Consider the following picture showing a RAM within which is stored a very simple program divided into instruction block and data block. The example is very similar to the ones found in the book "Code" by Charles Petzold:
As you see there's an instruction block and a data block. In the book this RAM is put inside a rudimentary computer within which you have to put manually both data and instructions by using some switches (just like the old altair 8800). In order for the machine to start executing instructions you had to set the initial address of instruction block and then the machine started executing one instruction after the other sequentially. Basically all this program does is loading the value 1 into accumulator, then add 5 to it, store the result in the address 000Ch (h stands for hex) and finally it stops executing with Halt instruction.
Now when I try to connect the knowledge I got from this book to the way a C source code is compiled I get a bit confused. Specifically the phase in which there's some separation between code segment and data segment. Consider this simple source code:
#include <stdlib.h>
#include <stdio.h>
int test=10;
int main(){
test ++;
return 0;
}
Now my idea is that the compiler should tell the computer to execute machine instructions like this :
int test=10; -> STORE [addressX],10
int main()
{
test++; -> LOAD A,[addressX]
-> INR A
-> STORE [addressX],A
return 0;
}
According to the definition of Wikipedia the data segment "contains initialized static variables, i.e. global variables and local static variables which have a defined value and can be modified".
In my simple example the variable test is a global variable.
However my idea is that before the variable is put inside the data segment of the RAM some sort of machine instructions like STORE must have been invoked. Otherwise how can the global variable be stored inside RAM?
Can someone explain in detail what is really happening and how the simple source code I showed here is really divided into text segment and data segment. What is exactly the text segment for this example? And what about the data segment?
I hope you understood what my doubt is and be able to answer as clear as possible. I appreciate if you could also address me to some good and in deep (with example and not abstract) resources to understand what's really going on when dealing with code, data, stack and heap segments.
I can explain how things work roughly on Windows.
First of all, the given information in the book does not apply that much to modern nowadays OSes. Most OS (such as Windows, Linux, etc.) has an executable file format that describes how the code and data are stored within the file, how they can be mapped into RAM, where to start to execute the code so on. On Windows, the format is called Portable Executable. PE format consists of zero or more sections to store the code and other data. Sections contain some important information such as how the OS will find the data of the section in the file, how to map this data to the memory, what kind of protection method will be used for this data in the memory. Sections can also have a name such as .text, .data, .bss, .idata, .rdata giving a clue about what kind of data the section contains.
When you compile and link your code with MSVC on Windows, you have a portable executable file for your program. This PE file will have one or more sections. For your example, it may have a .text for code, a .data for initialized data, and a .idata section for your imports from other modules. .text section has the compiled machine code, .data section has the data of value 10 for the variable test. When you execute the file, the OS loader will try to load, parse and map it into the memory created for its process.
So, you don't need a STORE instruction to store and initialize the data in RAM. All data in your program is located at the corresponding section and will be mapped into memory by the loader.

Receiving Segmentation fault when trying to execute injected code inside ELF binary

I am currently working on an ELF-injector and my approach is standard: find code cave (long enough sequence of 0's), rewrite it with the instructions I want to execute and then jump back to the start of the original program to execute it as it normally would.
To actually execute code in the code cave I tried two different approaches, both of which result in sigsegv.
First one was changing entry point to the start of the code cave. The second one was "stealing" some of the first instructions from the original code and write jump to my code cave there and then after executing my injected code I would first execute stolen instructions and then jump to the instruction after the last stolen one in the original program.
I am also changing the access flags for the section, in which code cave resides.
Here are some screenshots of debugging the program in gdb:
And here are the flags for the section the code cave is in:
[19] 0x555555556058->0x555555556160 at 0x00002058: .eh_frame ALLOC LOAD READONLY CODE HAS_CONTENTS
EDIT: This is the Valgrind output, so the problem is actually with the permissions. How can I allow the code inside this section be executed?
I am also changing the access flags for the section, in which code cave resides.
Sections are not used at runtime, only segments are. Changing access flags on a section after the program is linked does exactly nothing.
You need to find a place for your cove in a segment with the right permissions.
P.S. You appear to be using objdump to examine contents of your ELF file.
Don't: it's entirely inadequate. Use readelf instead.

Linux: using backtrace(), /proc/self/maps and addr2line together results in invalid result

I'm trying to implement a way to record callstacks of my program into a file then display it later.
Here are the steps:
Write the content of /proc/self/maps to a log file.
In this example, the content of /proc/self/maps is:
00400000-05cdc000 r-xp 00000000 00:51 12974779926 helloworld
Which means the base address of helloworld program is 0x400000.
In the program, whenever an interesting code needs to have its callstack recorded, I use the function backtrace() to obtain the callstack's addresses then write to the log file. Let say the callstack in this example is:
0x400001
0x400003
At some point later, in a separate log viewer program, the log file is opened and parsed. An address in the callstack will be deducted by the base address of the program. In this case:
0x400001 - 0x400000 = 1
I then use this deducted offset to obtain the line number using addr2line program:
addr2line -fCe hellowork 0x1
However this produces ??? result, i.e. invalid offset.
But if I don't deduct the callstack's address, but pass the actual value to add2line command:
addr2line -fCe hellowork 0x400001, then it returns correct file and line number.
The thing is if the address in within a shared object, then an absolute address won't work while a deducted offset will.
Why is there such a difference in the way the addresses are mapped for the main executable and the shared objects? Or maybe this is backtrace implementation specific, such that it always returns an absolute address for a function within the main executable?
Why is there such a difference in the way the addresses are mapped for the main executable and the shared objects?
The shared libraries are usually linked at address 0 and relocated. The non-position executable is usually linked at address 0x400000 on x86_64 Linux and must not be relocated (or it wouldn't work).
To find out where a given ELF binary is linked, look at the p_vaddr address of the fist PT_LOAD segment (readelf -Wl foo will show you that). In addition, only ET_DYN ELF binaries can be relocated, while ET_EXEC binaries must not be.
Note that position-independent executables exist, and for them you need to do the subtraction.
Note that shared libraries are usually linked at address 0 (and so subtraction works), but they don't have to. Running prelink on a shared library will result in a shared library linked at non-0 address, and then the subtraction you use will not work either.
Really, what you need to do is subtract at-runtime load address from linked-at address to get relocation (which would be 0 for non-PIE executables, and non-0 for shared libraries), and then subtract that relocation from the program counter recorded by backtrace to get the symbol value.
Finally, if you iterate over all loaded ELF images with dl_iterate_phdr, the dlpi_addr it provides is exactly the relocation that you need to subtract.

8085 Microprocessor: How to see the changes your program made to memory

I want to write an assembler for the 8085 in C. I used GNUSIM8085 to review my knowledge of assembly.
When I learned assembly in my microprocessor class where I used ASMIDE with HCS12 Dragonboard. With ASMIDE and Dragonboard I used some instructions (forgot what they were) to display the data in different memory locations both before and after running the program and also an instruction to load and run the program.
It was something like this:
// Load assembly program
// Check memory values of A1H - A9H (for example)
// Run program (that modifies those memory locations)
// Check memory values of A1H - A9H
I forgot what exactly the instructions were but I want to know what the equivalent instructions are in with 8085. In GNUSIM8085 I can see the changes that have been made to memory in a GUI. Like this:
I want my assembler to be purely a command line application so I want something similar to ASMIDE. I can't find the instructions for loading and reading data from memory or for running a program in any instruction set.
I'm starting to think that it doesn't really have anything to do with the microprocessor itself and that the instructions I used in my microprocessors class were specific to ASMIDE.
In that case should I make up my own instructions for reading data, loading program etc?

How does OS execute binary files in virtual memory?

For example in my program I called a function foo(). The compiler and assembler would eventually write jmp someaddr in the binary. I know the concept of virtual memory. The program would think that it has the whole memory at disposal, and the start position is 0x000. In this way the assembler can calculate the position of foo().
But in fact this is not decided until runtime right? I have to run the program to know where I loaded the program into, hence the address of the jmp. But when the program actually runs, how does the OS come in and change the address of the jmp? These are direct CPU instructions right?
This question can't be answered in general because it's totally hardware and OS dependent. However a typical answer is that the initially loaded program can be compiled as you say: Because the VM hardware gives each program its own address space, all addresses can be fixed when the program is linked. No recalculation of addresses at load time is needed.
Things get much more interesting with dynamically loaded libraries because two used by the same initially loaded program might be compiled with the same base address, so their address spaces overlap.
One approach to this problem is to require Position Independent Code in DLLs. In such code all addresses are relative to the code itself. Jumps are usually relative to the PC (though a code segment register can also be used). Data are also relative to some data segment or base register. To choose the runtime location, the PIC code itself needs no change. Only the segment or base register(s) need(s) be set whenever in the prelude of every DLL routine.
PIC tends to be a bit slower than position dependent code because there's additional address arithmetic and the PC and/or base registers can bottleneck the processor's instruction pipeline.
So the other approach is for the loader to rebase the DLL code when necessary to eliminate address space overlaps. For this the DLL must include a table of all the absolute addresses in the code. The loader computes an offset between the assumed code and data base addresses and actual, then traverses the table, adding the offset to each absolute address as the program is copied into VM.
DLLs also have a table of entry points so that the calling program knows where the library procedures start. These must be adjusted as well.
Rebasing is not great for performance either. It slows down loading. Moreover, it defeats sharing of DLL code. You need at least one copy per rebase offset.
For these reasons, DLLs that are part of Windows are deliberately compiled with non-overlapping VM address spaces. This speeds loading and allows sharing. If you ever notice that a 3rd party DLL crunches the disk and loads slowly, while MS DLLs like the C runtime library load quickly, you are seeing the effects of rebasing in Windows.
You can infer more about this topic by reading about object file formats. Here is one example.
Position-independent code is code that you can run from any address. If you have a jmp instruction in position-independent code, it will often be a relative jump, which jumps to an offset from the current location. When you copy the code, it won't change the offsets between parts of the code so it will still work.
Relocatable code is code that you can run from any address, but you might have to modify the code first (maybe you can't just copy it). The code will contain a relocation table which tells how it needs to be modified.
Non-relocatable code is code that must be loaded at a certain address or it will not work.
Each program is different, it depends on how the program was written, or the compiler settings, or other various factors.
Shared libraries are usually compiled as position-independent code, which allows the same library to be loaded at different locations in different processes, without having to load multiple copies into memory. The same copy can be shared between processes, even though it is at a different address in each process.
Executables are often non-relocatable, but they can be position-independent. Virtual memory allows each program to have the entire address space (minus some overhead) to itself, so each executable can choose the address at which it's loaded without worrying about collisions with other executables. Some executables are position-independent, which can be used to increase security (ASLR).
Object files and static libraries are usually relocatable code. The linker will relocate them when combining them to create a shared library, executable, or other image.
Boot loaders and operating system kernels are almost always non-relocatable.
Yes, it is at runtime. The operating system, the part managing starting and switching tasks is ideally at a different protection level, it has more power. It knows what memory is in use and allocates some for the new task. It configures the mmu so that the new task has a virtual address space starting at zero or whatever the rule is for that operating system and processor. How you get into user mode at that starting address, is very processor specific.
One method for example is the hardware might save some state not just address but mode or virtual id or something when an interrupt occurs, lets say on the stack. And the return from interrupt instruction as defined by that processor takes the address, and state/mode, off of the stack and switches there (causing lets assume the mmu to react to its next fetch based on the new mode not the old). For a processor that works like that then you may be able to fake an interrupt return by placing the right items on the stack such that when you kick the interrupt return instruction it basically does a jump with additional features of mode switching, etc.
The ARM family for example (not cortex-m) has a processor state register for what you are running now (in the case of an interrupt or service call) and a second state register for where you came from, the state that was interrupted, when you do the proper return you give it the address and it switches back to that mode using the other register. You can directly access that register from the non-users modes so you can manipulate the state of the return. There is no return instruction in arm, just flavors of jump (modifications to the program counter), so it is a special kind of jump.
The short answer is that it is very specific to the processor as to what your choices are for jumping to the first time or returning to after a task switch to a running task in an application mode in a virtual address space. Either directly or indirectly the processor documentation will describe these modes and how you change them. If not explicitly described then you have to figure out on your own from the instructions and the mmu protections and such how to switch tasks.

Resources