Code generation for expressions with fixed/preassigned register - c

I'm using this (see below) algorithm(take idea from this answer) to code generation from a tree. I'm targeting x86 arch, now I need to deal with mul/div instructions which uses registers eax/ebx as argument.
My question is:
How do I modify this to load operands of a certain instruction to load at fixed register? say, for mul instruction load left and right subtree on eax and ebx registers. My current implementation is: pass current node begin evaluated as argument and if it's MUL or DIV set reg to R0 or R1 according to tree's side, if it's LEFT or RIGHT respectively. If reg is in_use, push reg on stack and mark it as begin free(not implmented yet). The current implementation doesn't work because it does assert in assert(r1 != r2) in emit_load() function (meaning both registers passed as argument are equals like r1 = REG_R0 and r2 = REG_R0)
void gen(AST *ast, RegSet in_use, AST *root) {
if(ast->left != 0 && ast->right != 0) {
Reg spill = NoRegister; /* no spill yet */
AST *do1st, *do2nd; /* what order to generate children */
if (ast->left->n >= ast->right->n) {
do1st = ast->left;
do2nd = ast->right;
} else {
do1st = ast->right;
do2nd = ast->left; }
gen(do1st, in_use);
in_use |= 1 << do1st->reg;
if (all_used(in_use)) {
spill = pick_register_other_than(do1st->reg);
in_use &= ~(1 << spill);
emit_operation(PUSH, spill);
}
gen(do2nd, in_use);
ast->reg = ast->left->reg
emit_operation(ast->type, ast->left->reg, ast->right->reg);
if (spill != NoRegister)
emit_operation(POP, spill);
} else if(ast.type == Type_id || ast.type == Type_number) {
if(node->type == MUL || node->type == DIV) {
REG reg;
if(node_side == ASTSIDE_LEFT) reg = REG_R0;
if(node_side == ASTSIDE_RIGHT) reg = REG_R1;
if(is_reg_in_use(in_use, reg)) {
emit_operation(PUSH, reg);
}
} else {
ast->reg = pick_unused_register(in_use);
emit_load(ast);
}
} else {
print("gen() error");
// error
}
}
// ershov numbers
void label(AST ast) {
if(ast == null)
return;
label(ast.left);
label(ast.right);
if(ast.type == Type_id || ast.type == Type_number)
ast.n = 1;
// ast has two childrens
else if(ast.left not null && ast.right not null) {
int l = ast.left.n;
int r = ast.right.n;
if(l == r)
ast.n = 1 + l;
else
ast.n = max(1, l, r);
}
// ast has one child
else if(ast.left not null && ast.right is null)
ast.n = ast.left.n;
else
print("label() error!");
}

With a one-pass code generator like this, your options are limited. It's probably simpler to generate 3-address code or some other linear intermediate representation first and then worry about register targeting (this is the name for what you're trying to accomplish).
Nonetheless, what you want to do is possible. The caveat is that you won't get very high quality code. To make it better, you'll have to throw away this generator and start over.
The main problem you're experiencing is that Sethi-Ulman labeling is not a code generation algorithm. It's just a way of choosing the order of code generation. You're still missing important ideas.
With all that out of the way, some points:
Pushing and popping registers to save them temporarily makes life difficult. The reason is pretty obvious. You can only get access to the saved values in LIFO order.
Things become easier if you allocate "places" that may be either registers or memory locations in the stack frame. The memory locations effectively extend the register file to make it as large as necessary. A slight complication is that you'll need to remember for each function how many words are required for places in that function's stack frame and backpatch the function preamble to allocate that number.
Next, implement a global operand stack where each stack element is a PLACE. A PLACE is a descriptor telling where an operand that has been computed by already-emitted code is located: register or memory and how to access it. (For better code, you can also allow a PLACE to be a user variable and/or immediate value, but such PLACEs are never returned by the PLACE allocator described below. Additionally, the more kinds of PLACEs you allow, the more cases must be handled by the code emitter, also described below.)
The general principle is "be lazy." The later we can wait to emit code, the more information will be available. With more information, it's possible to generate better code. The stack of PLACEs does a reasonably good job of accomplishing this.
The code generator invariant is that it emits code that leaves the result PLACE at the top of the operand stack.
You will also need a PLACE allocator. This keeps track of registers and the memory words in use. It allocates new memory words if all registers and current words are already busy.
Registers in the PLACE allocator can have three possible statuses: FREE, BUSY, PINNED. A PINNED register is one needed to hold a value that can't be moved. (We'll use this for instructions with specific register requirements.) A BUSY register is one needed for a value that's okay to be moved to a different PLACE as required. A FREE register holds no value.
Memory in the PLACE allocator is either FREE or BUSY.
The PLACE allocator needs at least these entry points:
allocate_register pick a FREE register R, make it BUSY, and return R. If no FREE registers are available, allocate a FREE memory word P, move a BUSY register R's contents there, and return R.
pin_register(R) does as follows: If R is PINNED, raise a fatal error. If R is BUSY, get a FREE PLACE P (either register or memory word), emit code that moves the contents of R to P, mark R PINNED and return. If R is FREE, just mark it PINNED and return.
Note that when pinning or allocating register R requires moving its contents, the allocator must update the corresponding element in the operand stack. What was R must be changed to P. For this purpose, the allocator maintains a map taking each register to the operand stack PLACE that describes it.
With all this complete, the code generator for binary ops will be simple:
gen_code_for(ast_node) {
if (ast_node->left_first) {
gen_code_for(ast_node->left_operand)
gen_code_for(ast_node->right_operand)
} else {
gen_code_for(ast_node->right_operand)
gen_code_for(ast_node->left_operand)
swap_stack_top_2() // get stack top 2 elements in correct order
}
emit_code_for(ast_node)
}
The code emitter will work like this:
emit_code_for(ast_node) {
switch (ast_node->kind) {
case DIV: // An operation that needs specific registers
pin_register(EAX) // Might generate some code to make EAX available
pin_register(EDX) // Might generate some code to make EDX available
emit_instruction(XOR, EDX, EDX) // clear EDX
emit_instruction(MOV, EAX, stack(1)) // lhs to EAX
emit_instruction(DIV, stack(0)) // divide by rhs operand
pop(2) // remove 2 elements and free their PLACES
free_place(EDX) // EDX not needed any more.
mark_busy(EAX) // EAX now only busy, not pinned.
push(EAX) // Push result on operand stack
break;
case ADD: // An operation that needs no specific register.
PLACE result = emit_instruction(ADD, stack(1), stack(0))
pop(2)
push(result)
break;
... and so on
}
}
Finally, the instruction emitter must know what to do when its operands have combinations of types not supported by the processor instruction set. For example, it might have to load a memory PLACE into a register.
emit_instruction(op, lhs, [optional] rhs) {
switch (op) {
case DIV:
assert(RAX->state == PINNED && RDX->state == PINNED)
print_instruction(DIV, lhs)
return RAX;
case ADD:
if (lhs->kind == REGISTER) {
print_instruction(ADD, lhs, rhs)
return lhs
}
if (rhs->kind == REGISTER) {
print_instruction(ADD, rhs, lhs)
return rhs
}
// Both operands are MEMORY
R = allocate_register // Get a register; might emit some code.
print_instruction(MOV, R, lhs)
print_instruction(ADD, R, rhs)
return R
... and so on ...
I've necessarily let out many details. Ask what isn't clear.
OP's Questions Addressed
You're right that I intended stack(n) to be the PLACE that is n from the top of the operand stack.
Leaves of the syntax tree just push a PLACE for a computed value on the operand stack to satisfy the invariant.
As I said above, you can either create special kinds of PLACEs for these operands (user-labeled memory locations and/or immediate values), or - simpler and as you proposed - allocate a register and emit code that loads the value into that register, then push the register's PLACE onto the stack. The simpler method will result in unnecessary load instructions and consume more registers than needed. For example x = x + 1 will generate code something like:
mov esi, [ebp + x]
mov edi, 1
add esi, edi
mov [ebp + x], esi
Here I'm using x to denote the base pointer offset of the variable.
With PLACEs for variables and literals, you can easily get:
mov esi, [ebp + x]
add esi, 1
mov [ebp + x], esi
By making the code generator aware of the PLACE where the assignment needs to put its answer, you can get
add [ebp + x], 1
or equivalently
inc [bp + x]
Accomplish this by adding a parameter PLACE *target to the code generator that describes where the final value of the computed expression value needs to go. If you're not currently compiling an expression, this is set to NULL. Note that with target, the code generator invariant changes: The expression result's PLACE is at the top of the operand stack unless target is set. In that case, it's already been computed into the target's PLACE.
How would this work on x = x + 1? The ASSIGNMENT case in the emit_code_for procedure would provide the target as the PLACE for x when it calls itself recursively to compile x + 1. This delegates responsibility downward for getting the computed value to the right memory location, which is x. The emit_code_for case for ADD ow calls emit_code_for recursively to evaluate the operands x and 1 onto the stack. Since we have PLACEs for user variables and immediate values, these are pushed on the stack while generating no code at all. The ADD emitter must now be smart enough to know that if it sees a memory location L and a literal constant C on the stack and the target is also L, then it can emit add L, C, and it's done.
Remember that every time you make the code generator "smarter" by providing it with more information to make its decisions like this, it will become longer and more complicated because there are more cases to handle.

Related

Why does the stack frame also store instructions(besides data)? What is the precise mechanism by which instructions on stack frame get executed?

Short version:
0: 48 c7 c7 ee 4f 37 45 mov $0x45374fee, %rdi
7: 68 60 18 40 00 pushq $0x401860
c: c3 retq
How can these 3 lines of instruction(0,7,c), saved in the stack frame, get executed? I thought stack frame only store data, does it also store instructions? I know data is read to registers, but how do these instructions get executed?
Long version:
I am self-studying 15-213(Computer Systems) from CMU. In the Attack lab, there is an instance (phase 2) where the stack frame gets overwritten with "attack" instructions. The attack happens by then overwriting the return address from the calling function getbuf() with the address %rsp points to, which I know is the top of the stack frame. In this case, the top of the stack frame is in turn injected with the attack code mentioned above.
Here is the question, by reading the book(CSAPP), I get the sense that the stack frame only stores data the is overflown from the registers(including return address, extra arguments, etc.). But I don't get why it can also store instructions(attack code) and be executed. How exactly did the content in the stack frame, which %rsp points to, get executed? I also know that %rsp stores the return address of the calling function, the point being it is an address, not an instruction? So exactly by which mechanism does an supposed address get executed as an instruction? I am very confused.
Edit: Here is a link to the question(4.2 level 2):
http://csapp.cs.cmu.edu/3e/attacklab.pdf
This is a post that is helpful for me in understanding: https://github.com/magna25/Attack-Lab/blob/master/Phase%202.md
Thanks for your explanation!
ret instruction gets a pointer from the current position of the stack and jumps to it. If, while in a function, you modify the stack to point to another function or piece of code that could be used maliciously, the code can return to it.
The code below doesn't necessarily compile, and it is just meant to represent the concept.
For example, we have two functions: add(), and badcode():
int add(int a, int b)
{
return a + b;
}
void badcode()
{
// Some very bad code
}
Let's also assume that we have a stack such as the below when we call add()
...
0x00....18 extra arguments
0x00....10 return address
0x00....08 saved RBP
0x00....00 local variables and etc.
...
If during the execution of add, we managed to change the return address to address of badcode(), on ret instruction we will automatically start executing badcode(). I don't know if this answer your question.
Edit:
An instruction is simply an array of numbers. Where you store them is irrelevant (mostly) to their execution. A stack is essentially an abstract data structure, it is not a special place in RAM. If your OS doesn't mark the stack as non-executable, there is nothing stopping the code on the stack from being returned to by the ret.
Edit 2:
I get the sense that the stack frame only stores data that is overflown
from the registers(including return address, extra arguments, etc.)
I do not think that you know how registers, RAM, stack, and programs are incorporated. The sense that stack frame only stores data that is overflown is incorrect.
Let's start over.
Registers are pieces of memory on your CPU. They are independent of RAM. There are mainly 8 registers on a CPU. a, c, d, b, si, di, sp, and bp. a is for accumulator and it generally used for arithmetic operations, likewise b stands for base, c stands for counter, d stands for data, si stands for source, di stands for destination, sp is the stack pointer, and bp is the base pointer.
On 16 bit computers a, b, c, d, si, di, sp, and bp are 16 bits (2 byte). The a, b, c, and d are often shown as ax, bx, cx, and dx where the x stands for extension from their original 8 bit versions. They can also be referred to as eax, ecx, edx, ebx, esi, edi, esp, ebp for 32 bit (e again stands for extended) and rax, rcx, rdx, rbx, rsi, rdi, rsp, rbp for 64 bit.
Once again these are on your CPU and are independent of RAM. CPU uses these registers to do everything that it does. You wanna add two numbers? put one of them inside ax and another one inside cx and add them.
You also have RAM. RAM (standing for Random Access Memory) is a storage device that allows you to access and modify all of its values using equal computation power or time (hence the term random access). Each value that RAM holds also has an address that determines where on the RAM this value is. CPU can use numbers and treat such numbers as addresses to access memory addresses of RAM. Numbers that are used for such purposes are called pointers.
A stack is an abstract data structure. It has a FILO (first in last out) structure which means that to access the first datum that you have stored you have to access all of the other data. To manipulate the stack CPU provides us with sp which holds the pointer to the current position of the stack, and bp which holds the top of the stack. The position that bp holds is called the top of the stack because the stack usually grows downwards meaning that if we start a stack from the memory address 0x100 and store 4 bytes in it, sp will now be at the memory address 0x100 - 4 = 0x9C. To do such operations automatically we have the push and pop instructions. In that sense a stack could be used to store any type of data regardless of the data's relation to registers are programs.
Programs are pieces of structured code that are placed on the RAM by the operating system. The operating system reads program headers and relevant information and sets up an environment for the program to run on. For each program a stack is set up, usually, some space for the heap is given, and instructions (which are the building blocks of a program) are placed in arbitrary memory locations that are either predetermined by the program itself or automatically given by the OS.
Over the years some conventions have been set to standardize CPUs. For example, on most CPU's ret instruction receives the system pointer size amount of data from the stack and jumps to it. Jumping means executing code at a particular RAM address. This is only a convention and has no relation to being overflown from registers and etc. For that reason when a function is called firstly the return address (or the current address in the program at the time of execution) is pushed onto the stack so that it could be retrieved later by ret. Local variables are also stored in the stack, along with arguments if a function has more than 6(?).
Does this help?
I know it is a long read but I couldn't be sure on what you know and what you don't know.
Yet Another Edit:
Lets also take a look at the code from the PDF:
void test()
{
int val;
val = getbuf();
printf("No exploit. Getbuf returned 0x%x\n", val);
}
Phase 2 involves injecting a small amount of code as part of your exploit string.
Within the file ctarget there is code for a function touch2 having the following C representation:
void touch2(unsigned val)
{
vlevel = 2; /* Part of validation protocol */
if (val == cookie) {
printf("Touch2!: You called touch2(0x%.8x)\n", val);
validate(2);
} else {
printf("Misfire: You called touch2(0x%.8x)\n", val);
fail(2);
}
exit(0);
}
Your task is to get CTARGET to execute the code for touch2 rather than returning to test. In this case,
however, you must make it appear to touch2 as if you have passed your cookie as its argument.
Let's think about what you need to do:
You need to modify the stack of test() so that two things happen. The first thing is that you do not return to test() but you rather return to touch2. The other thing you need to do is give touch2 an argument which is your cookie. Since you are giving only one argument you don't need to modify the stack for the argument at all. The first argument is stored on rdi as a part of x86_64 calling convention.
The final code that you write has to change the return address to touch2()'s address and also call mov rdi, cookie
Edit:
I before talked about RAM being able to store data on addresses and CPU being able to interact with them. There is a secret register on your CPU that you are not able to reach from you assembly code. This register is called ip/eip/rip. It stands for instruction pointer. This register holds a 16/32/64 bit pointer to an address on RAM. this particular address is the address that the CPU will execute in its clock cycle. With that in my we can say that what a ret instruction is doing is
pop rip
which means get the last 64 bits (8 bytes for a pointer) on the stack into this instruction pointer. Once rip is set to this value, the CPU begins executing this code. The CPU doesn't do any checks on rip whatsoever. You can technically do the following thing (excuse me, my assembly is in intel syntax):
mov rax, str ; move the RAM address of "str" into rax
push rax ; push rax into stack
ret ; return to the last pushed qword (8 bytes) on the stack
str: db "Hello, world!", 0 ; define a string
This code can call/execute a string. Your CPU will be very upset tho, that there is no valid instruction there and will probably stop working.

ARM Cortex M3: STMDB instruction - what exactly is decremented, and when?

I am writing an RTOS and there is something I don't understand. My context switch, written in assembly, has the line:
STMDB r0!,{r4-r11}
Where r0 is being used to store the current process stack pointer (PSP). Since this is in a handler and running in handler mode, the MSP is being used for the function, so I can't just push.
For the sake of argument let us say that r0 stores the address 0x64 (I am aware this is not reasonable, but it is not relevant to the discussion below).
Do I understand this correctly: the first register to be stored, r4, will be placed at 0x60, since the decrement before part means that r0 is first decremented by one 32-bit word, then the storage takes place?
'''TL-DR;''' 'DB' stands for 'decrement before'.
[stm|ldm][modifier] Rn!, {reg_list}
Rn! is the 'address register'
There are two mutually exclusive options for auto-index memory via the modifier.
Letter
Note
I
Increment the address register (ie, 3 register -> 12 bytes increment)
D
Decrement the address register
B
Before store/load operation
A
After store/load operation
You can have variants of a full/empty decrementing/incrementing stack. Ie, stack grows down/up and stack is empty/full. Decrement before would mean the stack is at a 'full' element and you grow down.
Of course, the same operations can be used for buffers. If you have a ring buffer, it can typically point to an empty or full element. This is a design choice. You would use the 'before' or 'after' versions and for ring buffers, we usually increment memory.
LDM and STM can come in all these four flavors.
LDMIA - increment after (empty increasing).
LDMIB - increment before (full increasing).
LDMDA - decrement after (empty decreasing).
LDMDB - decrement before (full decreasing).
If you don't modify the address register, then these modifiers don't make sense. Ie, you need ldmxx Rn!, {reglist} or stmxx Rn!, {reglist}. The single word versions have a different syntax.
See: ARM increment register, University of Regina lecture
Probably a good keyword is 'fully descending stack' for searches. Some assemblers will offer alternatives like,
stmfd - store multiple fully descending; alias stmdb.
stmed - store multiple empty descending; alias stmda.
I would just stick to the 'i','d', 'b' and 'a' permutations.
What exactly is decremented and when?
It is always the leading address register that is modified. It is either before or after the register list is transferred as to when this occurs. Hopefully the above describes data structures where this is useful.
It is always a single word that is empty/full and not the whole register list. Register list is ordered numerically for access. Ie, the CPU always writes/reads R0 first and R15 (if possible) last. You can just include/exclude a 'reg list' bit in the opcode.
pretty clear from the arm docs.
with respect to DB, decrement before
start_address = Rn - (Number_Of_Set_Bits_In(register_list) * 4)
end_address = Rn - 4
if ConditionPassed(cond) and W == 1 then
Rn = Rn - (Number_Of_Set_Bits_In(register_list) * 4)
STM in general
if ConditionPassed(cond) then
address = start_address
for i = 0 to 15
if register_list[i] == 1
Memory[address,4] = Ri
address = address + 4
assert end_address == address - 4
what part do you not understand?

How to corrupt the stack in a C program

I have to change the designated section of function_b so that it changes the stack in such a way that the program prints:
Executing function_a
Executing function_b
Finished!
At this point it also prints Executed function_b in between Executing function_b and Finished!.
I have the following code and I have to fill something in, in the part where it says // ... insert code here
#include <stdio.h>
void function_b(void){
char buffer[4];
// ... insert code here
fprintf(stdout, "Executing function_b\n");
}
void function_a(void) {
int beacon = 0x0b1c2d3;
fprintf(stdout, "Executing function_a\n");
function_b();
fprintf(stdout, "Executed function_b\n");
}
int main(void) {
function_a();
fprintf(stdout, "Finished!\n");
return 0;
}
I am using Ubuntu Linux with the gcc compiler. I compile the program with the following options: -g -fno-stack-protector -fno-omit-frame-pointer. I am using an intel processor.
Here is a solution, not exactly stable across environments, but works for me on x86_64 processor on Windows/MinGW64.
It may not work for you out of the box, but still, you might want to use a similar approach.
void function_b(void) {
char buffer[4];
buffer[0] = 0xa1; // part 1
buffer[1] = 0xb2;
buffer[2] = 0xc3;
buffer[3] = 0x04;
register int * rsp asm ("rsp"); // part 2
register size_t r10 asm ("r10");
r10 = 0;
while (*rsp != 0x04c3b2a1) {rsp++; r10++;} // part 3
while (*rsp != 0x00b1c2d3) rsp++; // part 4
rsp -= r10; // part 5
rsp = (int *) ((size_t) rsp & ~0xF); // part 6
fprintf(stdout, "Executing function_b\n");
}
The trick is that each of function_a and function_b have only one local variable, and we can find the address of that variable just by searching around in the memory.
First, we put a signature in the buffer, let it be the 4-byte integer 0x04c3b2a1 (remember that x86_64 is little-endian).
After that, we declare two variables to represent the registers: rsp is the stack pointer, and r10 is just some unused register.
This allows to not use asm statements later in the code, while still being able to use the registers directly.
It is important that the variables don't actually take stack memory, they are references to processor registers themselves.
After that, we move the stack pointer in 4-byte increments (since the size of int is 4 bytes) until we get to the buffer. We have to remember the offset from the stack pointer to the first variable here, and we use r10 to store it.
Next, we want to know how far in the stack are the instances of function_b and function_a. A good approximation is how far are buffer and beacon, so we now search for beacon.
After that, we have to push back from beacon, the first variable of function_a, to the start of instance of the whole function_a on the stack.
That we do by subtracting the value stored in r10.
Finally, here comes a werider bit.
At least on my configuration, the stack happens to be 16-byte aligned, and while the buffer array is aligned to the left of a 16-byte block, the beacon variable is aligned to the right of such block.
Or is it something with a similar effect and different explanation?..
Anyway, so we just clear the last four bits of the stack pointer to make it 16-byte aligned again.
The 32-bit GCC doesn't align anything for me, so you might want to skip or alter this line.
When working on a solution, I found the following macro useful:
#ifdef DEBUG
#define show_sp() \
do { \
register void * rsp asm ("rsp"); \
fprintf(stdout, "stack pointer is %016X\n", rsp); \
} while (0);
#else
#define show_sp() do{}while(0);
#endif
After this, when you insert a show_sp(); in your code and compile with -DDEBUG, it prints what is the value of stack pointer at the respective moment.
When compiling without -DDEBUG, the macro just compiles to an empty statement.
Of course, other variables and registers can be printed in a similar way.
ok, let assume that epilogue (i.e code at } line) of function_a and for function_b is the same
despite functions A and B not symmetric, we can assume this because it have the same signature (no parameters, no return value), same calling conventions and same size of local variables (4 byte - int beacon = 0x0b1c2d3 vs char buffer[4];) and with optimization - both must be dropped because unused. but we must not use additional local variables in function_b for not break this assumption. most problematic point here - what is function_A or function_B will be use nonvolatile registers (and as result save it in prologue and restore in epilogue) - but however look like here no place for this.
so my next code based on this assumption - epilogueA == epilogueB (really solution of #Gassa also based on it.
also need very clearly state that function_a and function_b must not be inline. this is very important - without this any solution impossible. so I let yourself add noinline attribute to function_a and function_b. note - not code change but attribute add, which author of this task implicitly implies but not clearly stated. don't know how in GCC mark function as noinline but in CL __declspec(noinline) for this used.
next code I write for CL compiler where exist next intrinsic function
void * _AddressOfReturnAddress();
but I think that GCC also must have the analog of this function. also I use
void* _ReturnAddress();
but however really _ReturnAddress() == *(void**)_AddressOfReturnAddress() and we can use _AddressOfReturnAddress() only. simply using _ReturnAddress() make source (but not binary - it equal) code smaller and more readable.
and next code is work for both x86 and x64. and this code work (tested) with any optimization.
despite I use 2 global variables - code is thread safe - really we can call main from multiple threads in concurrent, call it multiple time - but all will be worked correct (only of course how I say at begin if epilogueA == epilogueB)
hope comments in code enough self explained
__declspec(noinline) void function_b(void){
char buffer[4];
buffer[0] = 0;
static void *IPa, *IPb;
// save the IPa address
_InterlockedCompareExchangePointer(&IPa, _ReturnAddress(), 0);
if (_ReturnAddress() == IPa)
{
// we called from function_a
function_b();
// <-- IPb
if (_ReturnAddress() == IPa)
{
// we called from function_a, change return address for return to IPb instead IPa
*(void**)_AddressOfReturnAddress() = IPb;
return;
}
// we at stack of function_a here.
// we must be really at point IPa
// and execute fprintf(stdout, "Executed function_b\n"); + '}' (epilogueA)
// but we will execute fprintf(stdout, "Executing function_b\n"); + '}' (epilogueB)
// assume that epilogueA == epilogueB
}
else
{
// we called from function_b
IPb = _ReturnAddress();
return;
}
fprintf(stdout, "Executing function_b\n");
// epilogueB
}
__declspec(noinline) void function_a(void) {
int beacon = 0x0b1c2d3;
fprintf(stdout, "Executing function_a\n");
function_b();
// <-- IPa
fprintf(stdout, "Executed function_b\n");
// epilogueA
}
int main(void) {
function_a();
fprintf(stdout, "Finished!\n");
return 0;
}

possible to do if (!boolvar) { ... in 1 asm instruction?

This question is more out of curiousity than necessity:
Is it possible to rewrite the c code if ( !boolvar ) { ... in a way so it is compiled to 1 cpu instruction?
I've tried thinking about this on a theoretical level and this is what I've come up with:
if ( !boolvar ) { ...
would need to first negate the variable and then branch depending on that -> 2 instructions (negate + branch)
if ( boolvar == false ) { ...
would need to load the value of false into a register and then branch depending on that -> 2 instructions (load + branch)
if ( boolvar != true ) { ...
would need to load the value of true into a register and then branch ("branch-if-not-equal") depending on that -> 2 instructions (load + "branch-if-not-equal")
Am I wrong with my assumptions? Is there something I'm overlooking?
I know I can produce intermediate asm versions of programs, but I wouldn't know how to use this in a way so I can on one hand turn on compiler optimization and at the same time not have an empty if statement optimized away (or have the if statement optimized together with its content, giving some non-generic answer)
P.S.: Of course I also searched google and SO for this, but with such short search terms I couldn't really find anything useful
P.P.S.: I'd be fine with a semantically equivalent version which is not syntactical equivalent, e.g. not using if.
Edit: feel free to correct me if my assumptions about the emitted asm instructions are wrong.
Edit2: I've actually learned asm about 15yrs ago, and relearned it about 5yrs ago for the alpha architecture, but I hope my question is still clear enough to figure out what I'm asking. Also, you're free to assume any kind of processor extension common in consumer cpus up to AVX2 (current haswell cpu as of the time of writing this) if it helps in finding a good answer.
At the end of my post it will say why you should not aim for this behaviour (on x86).
As Jerry Coffin has written, most jumps in x86 depend on the flags register.
There is one exception though: The j*cxz set of instructions which jump if the ecx/rcx register is zero. To achieve this you need to make sure that your boolvar uses the ecx register. You can achieve that by specifically assigning it to that register
register int boolvar asm ("ecx");
But by far not all compilers use the j*cxz set of instructions. There is a flag for icc to make it do that, but it is generally not advisable. The Intel manual states that two instructions
test ecx, ecx
jz ...
are faster on the processor.
The reason for being this is that x86 is a CISC (complex) instruction set. In the actual hardware though the processor will split up complex instructions that appear as one instruction in the asm into multiple microinstructions which are then executed in a RISC style. This is the reason why not all instructions require the same execution time and sometimes multiple small ones are faster then one big one.
test and jz are single microinstructions, but jecxz will be decomposed into those two anyways.
The only reason why the j*cxz set of instructions exist is if you want to make a conditional jump without modifying the flags register.
Yes, it's possible -- but doing so will depend on the context in which this code takes place.
Conditional branches in an x86 depend upon the values in the flags register. For this to compile down to a single instruction, some other code will already need to set the correct flag, so all that's left is a single instruction like jnz wherever.
For example:
boolvar = x == y;
if (!boolvar) {
do_something();
}
...could end up rendered as something like:
mov eax, x
cmp eax, y ; `boolvar = x == y;`
jz #f
call do_something
##:
Depending on your viewpoint, it could even compile down to only part of an instruction. For example, quite a few instructions can be "predicated", so they're executed only if some previously defined condition is true. In this case, you might have one instruction for setting "boolvar" to the correct value, followed by one to conditionally call a function, so there's no one (complete) instruction that corresponds to the if statement itself.
Although you're unlikely to see it in decently written C, a single assembly language instruction could include even more than that. For an obvious example, consider something like:
x = 10;
looptop:
-- x;
boolvar = x == 0;
if (!boolvar)
goto looptop;
This entire sequence could be compiled down to something like:
mov ecx, 10
looptop:
loop looptop
Am I wrong with my assumptions
You are wrong with several assumptions. First you should know that 1 instruction is not necessarily faster than multiple ones. For example in newer μarchs test can macro-fuse with jcc, so 2 instructions will run as one. Or a division is so slow that in the same time tens or hundreds of simpler instructions may already finished. Compiling the if block to a single instruction doesn't worth it if it's slower than multiple instructions
Besides, if ( !boolvar ) { ... doesn't need to first negate the variable and then branch depending on that. Most jumps in x86 are based on flags, and they have both the yes and no conditions, so no need to negate the value. We can simply jump on non-zero instead of jump on zero
Similarly if ( boolvar == false ) { ... doesn't need to load the value of false into a register and then branch depending on that. false is a constant equal to 0, which can be embedded as an immediate in the instruction (like cmp reg, 0). But for checking against zero then just a simple test reg, reg is enough. Then jnz or jz will be used to jump on zero/non-zero, which will be fused with the previous test instruction into one
It's possible to make an if header or body that compiles to a single instruction, but it depends entirely on what you need to do, and what condition is used. Because the flag for boolvar may already be available from the previous statement, so the if block in the next line can use it to jump directly like what you see in Jerry Coffin's answer
Moreover x86 has conditional moves, so if inside the if is a simple assignment then it may be done in 1 instruction. Below is an example and its output
int f(bool condition, int x, int y)
{
int ret = x;
if (!condition)
ret = y;
return ret;
}
f(bool, int, int):
test dil, dil ; if(!condition)
mov eax, edx ; ret = y
cmovne eax, esi ; if(condition) ret = x
ret
Some other cases you don't even need a conditional move or jump. For example
bool f(bool condition)
{
bool ret = false;
if (!condition)
ret = true;
return ret;
}
compiles to a single xor without any jump at all
f(bool):
mov eax, edi
xor eax, 1
ret
ARM architecture (v7 and below) can run any instruction as conditional so that may translate to only one instruction
For example the following loop
while (i != j)
{
if (i > j)
{
i -= j;
}
else
{
j -= i;
}
}
can be translated to ARM assembly as
loop: CMP Ri, Rj ; set condition "NE" if (i != j),
; "GT" if (i > j),
; or "LT" if (i < j)
SUBGT Ri, Ri, Rj ; if "GT" (Greater Than), i = i-j;
SUBLT Rj, Rj, Ri ; if "LT" (Less Than), j = j-i;
BNE loop ; if "NE" (Not Equal), then loop

Buffer overflow in C

I'm attempting to write a simple buffer overflow using C on Mac OS X 10.6 64-bit. Here's the concept:
void function() {
char buffer[64];
buffer[offset] += 7; // i'm not sure how large offset needs to be, or if
// 7 is correct.
}
int main() {
int x = 0;
function();
x += 1;
printf("%d\n", x); // the idea is to modify the return address so that
// the x += 1 expression is not executed and 0 gets
// printed
return 0;
}
Here's part of main's assembler dump:
...
0x0000000100000ebe <main+30>: callq 0x100000e30 <function>
0x0000000100000ec3 <main+35>: movl $0x1,-0x8(%rbp)
0x0000000100000eca <main+42>: mov -0x8(%rbp),%esi
0x0000000100000ecd <main+45>: xor %al,%al
0x0000000100000ecf <main+47>: lea 0x56(%rip),%rdi # 0x100000f2c
0x0000000100000ed6 <main+54>: callq 0x100000ef4 <dyld_stub_printf>
...
I want to jump over the movl instruction, which would mean I'd need to increment the return address by 42 - 35 = 7 (correct?). Now I need to know where the return address is stored so I can calculate the correct offset.
I have tried searching for the correct value manually, but either 1 gets printed or I get abort trap – is there maybe some kind of buffer overflow protection going on?
Using an offset of 88 works on my machine. I used Nemo's approach of finding out the return address.
This 32-bit example illustrates how you can figure it out, see below for 64-bit:
#include <stdio.h>
void function() {
char buffer[64];
char *p;
asm("lea 4(%%ebp),%0" : "=r" (p)); // loads address of return address
printf("%d\n", p - buffer); // computes offset
buffer[p - buffer] += 9; // 9 from disassembling main
}
int main() {
volatile int x = 7;
function();
x++;
printf("x = %d\n", x); // prints 7, not 8
}
On my system the offset is 76. That's the 64 bytes of the buffer (remember, the stack grows down, so the start of the buffer is far from the return address) plus whatever other detritus is in between.
Obviously if you are attacking an existing program you can't expect it to compute the answer for you, but I think this illustrates the principle.
(Also, we are lucky that +9 does not carry out into another byte. Otherwise the single byte increment would not set the return address how we expected. This example may break if you get unlucky with the return address within main)
I overlooked the 64-bitness of the original question somehow. The equivalent for x86-64 is 8(%rbp) because pointers are 8 bytes long. In that case my test build happens to produce an offset of 104. In the code above substitute 8(%%rbp) using the double %% to get a single % in the output assembly. This is described in this ABI document. Search for 8(%rbp).
There is a complaint in the comments that 4(%ebp) is just as magic as 76 or any other arbitrary number. In fact the meaning of the register %ebp (also called the "frame pointer") and its relationship to the location of the return address on the stack is standardized. One illustration I quickly Googled is here. That article uses the terminology "base pointer". If you wanted to exploit buffer overflows on other architectures it would require similarly detailed knowledge of the calling conventions of that CPU.
Roddy is right that you need to operate on pointer-sized values.
I would start by reading values in your exploit function (and printing them) rather than writing them. As you crawl past the end of your array, you should start to see values from the stack. Before long you should find the return address and be able to line it up with your disassembler dump.
Disassemble function() and see what it looks like.
Offset needs to be negative positive, maybe 64+8, as it's a 64-bit address. Also, you should do the '+7' on a pointer-sized object, not on a char. Otherwise if the two addresses cross a 256-byte boundary you will have exploited your exploit....
You might try running your code in a debugger, stepping each assembly line at a time, and examining the stack's memory space as well as registers.
I always like to operate on nice data types, like this one:
struct stackframe {
char *sf_bp;
char *sf_return_address;
};
void function() {
/* the following code is dirty. */
char *dummy;
dummy = (char *)&dummy;
struct stackframe *stackframe = dummy + 24; /* try multiples of 4 here. */
/* here starts the beautiful code. */
stackframe->sf_return_address += 7;
}
Using this code, you can easily check with the debugger whether the value in stackframe->sf_return_address matches your expectations.

Resources