Understanding C vs its Assembly counterpart - c

Say we are given a function:
int exchange(int*xp, int y)
{
x = *xp;
*xp = y;
return x;
}
So, the book I am reading explains that xp is stored at offsets 8 and 12 relative to the address register %ebp. What I am not understanding is why they are stored as any kind of unit 8 and 12, further more: What is an offset in this context? Finally, how do 8 and 12 fit when the register accepts movement in units of 1 2 and 4 bytes respectively?
The assembly code :
xp at %ebp+8, y at%ebp+12
1 movl 8(%ebp), %edx (Get xp By copying to %eax below, x becomes the return value)
2 movl (%edx), %eax (Get x at xp)
3 movl 12(%ebp), %ecx (Get y)
4 movl %ecx, (%edx) (Store y at xp)
What I think the answer is:
So, when examining registries, it was common to see something like registry %rdi holding a value of 0x1004 which is an address and 0x1004 is in the address which holds a value 0xAA.
Of course, this is a hypothetical example that doesn't line up with the registries listed in the book. Each registry is 16-32 bit and the top four can be used to store integers freely. Does offsetting it by 8 make it akin to 0x1000 + 8? Again, I'm not entirely sure what the offset in this scenario is for when we are storing new units into empty space.

Because of how the call stack is structured when using C declaration.
First the caller will push the 4-byte y, then the 4-byte xp (this order is important so C can support Variadic Functions), then the call to your function will implicitly push the return address which is also 4-byte (this is a 32-bit program).
The first thing your function does is push the state of ebp which it will need to recover later so that the caller can continue working properly, and then copy the current state of esp (stack pointer) to ebp. In sum:
push %ebp
movl %esp, %ebp
This is also known as function prologue.
When all this is done you are finally ready to actually run the code you wrote, at this stage the stack is something like this:
%ebp- ? = address of your local variables (which in this example you don't have)
%ebp+ 0 = address of the saved state of previous ebp
%ebp+ 4 = ret address
%ebp+ 8 = address where is stored the value of xp
%ebp+12 = address where is stored the value of y
%ebp+16 = out of bonds, this memory space belongs to the caller
When your function is done it will wrap it up by setting esp back to ebp, then pop the original ebp and ret.
movl %ebp, %esp
pop %ebp
ret
ret is basically a shortcut to pop a pointer from the stack and jmp to it.
Edit: Fixed order of parameters for AT&T assembly

Look at the normal function entry in assembler:
push ebp
mov ebp, esp
sub esp, <size of local variables>
So ebp+4 holds the previous value of ebp. Before the old ebp was the return address, at ebp+8. Before that are the parameters of the function, in reverse order, so the first parameter is at ebp+12 and the second at ebp+8.

Related

Given Assembly, translate to C

I am originally given the function prototype:
void decode1(int *xp, int *yp, int *zp)
now i am told to convert the following assembly into C code:
movl 8(%ebp), %edi //line 1 ;; gets xp
movl 12(%ebp), %edx //line 2 ;; gets yp
movl 16(%ebp),%ecx //line 3 ;; gets zp
movl (%edx), %ebx //line 4 ;; gets y
movl (%ecx), %esi //line 5 ;; gets z
movl (%edi), %eax //line 6 ;; gets x
movl %eax, (%edx) //line 7 ;; stores x into yp
movl %ebx, (%ecx) //line 8 ;; stores y into zp
movl %esi, (%edi) //line 9 ;; stores z into xp
These comments were not given to me in the problem this is what I believe they are doing but am not 100% sure.
My question is, for lines 4-6, am I able to assume that the command
movl (%edx), %ebx
movl (%ecx), %esi
movl (%edi), %eax
just creates a local variables to y,z,x?
also, do the registers that each variable get stored in i.e (edi,edx,ecx) matter or can I use any register in any order to take the pointers off of the stack?
C code:
int tx = *xp;
int ty = *yp;
int tz = *zp;
*yp = tx;
*zp = ty;
*xp = tz;
If I wasn't given the function prototype how would I tell what type of return type is used?
Let's focus on a simpler set of instructions.
First:
movl 8(%ebp), %edi
will load into the EDI register the content of the 4 bytes that are situated on memory at 8 eight bytes beyond the address set in the EBP register. This special EBP usage is a convention followed by the compiler code generator, that per each function, saves the stack pointer ESP into the EBP registers, and then creates a stack frame for the function local variables.
Now, in the EDI register, we have the first parameter passed to the function, that is a pointer to an integer, so EDI contains now the address of that integer, but not the integer itself.
movl (%edi), %eax
will get the 4 bytes pointed by the EDI register and load them into the EAX register.
Now in EAX we have the value of the integer pointed by the xp in the first parameter.
And then:
movl %eax, (%edx)
will save this integer value into the memory pointed by the content of the EDX register which was in turn loaded from EBP+12 which is the second parameter passed to the function.
So, your first question, is this assembly code equivalent to this?
int tx = *xp;
int ty = *yp;
int tz = *zp;
*yp = tx;
*zp = ty;
*xp = tz;
is, yes, but note that there are no tx,ty,tz local variables created, but just processor registers.
And your second question, is no, you can't tell the type of return, it is, again, a convention on the register usage that you can't infer just by looking at the generated assembly code.
Congratulations, you got everything right :)
You can use any register but some need to be preserved, that is they should be saved before use and restored afterwards. In typical calling conventions you can use eax, ecx and edx, the rest need to be preserved. The assembly you showed doesn't include code to do this, but presumably it is there.
As for the return type, that's hard to deduce. Simple types are returned in the eax register, and something is always in there. We can't tell if that's intended as a return value, or just remains of a local variable. That is, if your function had return tx; it could be the same assembly code. Also, we don't know the type for eax either, it could be anything that fits in there and is expected to be returned there according to the calling convention.

pushing and changing of %esp frame pointer

I have a small program, written in C, echo():
/* Read input line and write it back */
void echo() {
char buf[8]; /* Way too small! */
gets(buf);
puts(buf);
}
The corresponding assembly code:
1 echo:
2 pushl %ebp //Save %ebp on stack
3 movl %esp, %ebp
4 pushl %ebx //Save %ebx
5 subl $20, %esp //Allocate 20 bytes on stack
6 leal -12(%ebp), %ebx //Compute buf as %ebp-12
7 movl %ebx, (%esp) //Store buf at top of stack
8 call gets //Call gets
9 movl %ebx, (%esp) //Store buf at top of stack
10 call puts //Call puts
11 addl $20, %esp //Deallocate stack space
12 popl %ebx //Restore %ebx
13 popl %ebp //Restore %ebp
14 ret //Return
I have a few questions.
Why does the %esp allocate 20 bytes? The buf is 8 bytes, why the extra 12?
The return address is right above where we pushed %ebp right? (Assuming we draw the stack upside down, where it grows downward) What is the purpose of the old %ebp (which the current %ebp is pointing at, as a result of line 3)?
If i want to change the return address (by inputting anything more than 12 bytes), it would change where echo() returns to. What is the consequence of changing the old %ebp (aka 4 bytes before the return address)? Is there any possibility of changing the return address or where echo returns to by just changing the old %ebp?
What is the purpose of the %ebp? I know its the frame pointer but, what is that?
Is it ever possible for the compiler to put the buffer somewhere that is not right next to where the old %ebp is stored? Like if we declare buf[8] but it stores it at -16(%ebp) instead of -12(%ebp) on line 6?
*c code and assembly copied from Computer Systems - A programmer's Perspective 2nd ed.
** Using gets() because doing buffer overflows
The reason 20 bytes are allocated is for the purpose of stack alignment. GCC 4.5+ generates code that ensures that the callee's local stack space is aligned to a 16-byte boundary, in order to ensure that compiled code can do aligned SSE loads and stores on the stack in a well-defined manner. For that reason, the compiler in this case needs to throw away some stack-space in order to ensure that gets/puts get a properly aligned frame.
In essence, this is how the stack will look, where each line is a 4-byte word except for --- lines that denote 16-byte address boundaries:
...
Saved EIP from caller
Saved EBP
---
Saved EBX # This is where echo's frame starts
buf
buf
Unused
---
Unused
Parameter to gets/puts
Saved EIP
Saved EBP
---
... # This is where gets'/puts' frame starts
As you can hopefully see from my fantastic ASCII graphics, if it weren't for the "unused" portions, gets/puts would get an unaligned frame. Do note also, however, that not 12 bytes are unused; 4 of them are reserved for the parameter.
Is it ever possible for the compiler to put the buffer somewhere that is not right next to where the old %ebp is stored? Like if we declare buf[8] but it stores it at -16(%ebp) instead of -12(%ebp) on line 6?
Certainly. The compiler is free to organize the stack however it feels like. In order to do buffer overflows predictably, you have to be looking at a specific compiled binary of a program.
As for what the purpose of EBP is (and thus to answer your questions 2, 3 and 5), please see any introductory text to how the call stack is organized, such as the Wikipedia article.

Assembly x86 - "leave" Instruction

It's said that the "leave" instruction is similar to:
movl %ebp, %esp
popl %ebp
I understand the movl %ebp, %esp part, and that it acts to release stored up memory (as discussed in this question).
But what is the purpose of the popl %ebp code?
LEAVE is the counterpart to ENTER. The ENTER instruction sets up a stack frame by first pushing EBP onto the stack and then copies ESP into EBP, so LEAVE has to do the opposite, i.e. copy EBP to ESP and then restore the old EBP from the stack.
See the section named PROCEDURE CALLS FOR BLOCK-STRUCTURED LANGUAGES in Intel's Software Developer's Manual Vol 1 if you want to read more about how ENTER and LEAVE work.
enter n,0 is exactly equivalent to (and should be replaced with)
push %ebp
mov %esp, %ebp # ebp = esp, mov ebp,esp in Intel syntax
sub $n, %esp # allocate space on the stack. Omit if n=0
leave is exactly equivalent to
mov %ebp, %esp # esp = ebp, mov esp,ebp in Intel syntax
pop %ebp
enter is very slow and compilers don't use it, but leave is fine. (http://agner.org/optimize). Compilers do use leave if they make a stack frame at all (at least gcc does). But if esp is already equal to ebp, it's most efficient to just pop ebp.
The popl instruction restores the base pointer, and the movl instruction restores the stack pointer. The base pointer is the bottom of the stack, and the stack pointer is the top. Before the leave instruction, the stack looks like this:
----Bottom of Caller's stack----
...
Caller's
Variables
...
Function Parameters
----Top of Caller's Stack/Bottom of Callee's stack---- (%ebp)
...
Callee's
Variables
...
---Bottom of Callee's stack---- (%esp)
After the movl %ebp %esp, which deallocates the callee's stack, the stack looks like this:
----Bottom of Caller's stack----
...
Caller's
Variables
...
Function Parameters
----Top of Caller's Stack/Bottom of Callee's stack---- (%ebp) and (%esp)
After the popl %ebp, which restores the caller's stack, the stack looks like this:
----Bottom of Caller's stack---- (%ebp)
...
Caller's
Variables
...
Function Parameters
----Top of Caller's Stack---- (%esp)
The enter instruction saves the bottom of the caller's stack and sets the base pointer so that the callee can allocate their stack.
Also note, that, while most C compilers allocate the stack this way(at least with optimization turn'd off), if you write an assembly language function, you can just use the same stack frame if you want to, but you have to be sure to pop everything off the stack that you push on it or else you'll jump to a junk address when you return(this is because call <somewhere> means push <ret address>[or push %eip], jmp <somewhere>, and ret means jump to the address on the top of the stack[or pop %eip]. %eip is the register that holds the address of the current instruction).

Understanding pre/post assembly code for a function call in x86 IA32 assembly

So we have the following code, setting up for a function call with its arguments, its main body omitted (etc etc etc), and then the popping at the end of the function.
pushl %ebp
movl %esp, %ebp
pushl %ebx
movl 8(%ebp), %ebx
movl 12(%ebp), %ecx
etc
etc
etc
//end of function
popl %ebx
popl %ebp
Here's what I (think) I understand.
Suppose we have %esp pointing to memory address 100.
pushl %ebp
So this essentially makes %ebp point to where %esp points (memory address 100) + 4. So now %ebp points to memory address 104. This leaves our current memory state looking like so:
----------
|100|%esp
|104|%ebp
----------
Then we have the next line of code:
movl %esp, %ebp
So from what I understand, ebp now pointers to memory address 100. I have a little intuition as to why we do this step, but my confusion is the next line:
pushl %ebx
What is the purpose of pushing ebx, which I assume will then point to memory address 104? I have a vague idea of how the space right below ebp (104) is supposed to be a reference to an "old stack pointer," so I can see why the next 2 lines add 8 and 12 to ebp to be the "arguments" of our function, rather than 4 and 8.
But I'm confused as to why we push ebx onto the stack, first.
I also do not understand popping, and why we pop ebx and ebp?
Talking to someone about this before he had to sleep, he mentioned that we have no reference to the fact that our stack pointer was at 100 -- until we pop ebp back. Now, I thought ebp's value was 100, so I don't understand the point he was trying to make.
So to clarify:
Is my understanding thus far correct?
Why do we push ebx onto the stack?
What is this "reference to the old stack pointer" that lives right below ebp? Is that the ebx that we push?
Is there something I'm not understanding, like some sort of difference between the ebx that we push, and the ebx in the line right after (our argument)? Is there a difference between the ebp that gets pushed and the ebp in the line right after?
Why are we popping at the end?
I apologize if this is difficult to understand. I understand similar questions have been asked about this, but I'm trying to intuitively understand and picture what exactly is going on in a function call in a way that makes sense to me.
Note: I edited some important things regarding my understanding of what's going on, particularly with regards to ebp.
As Joachim stated in a comment on your question, pushing a register pushes the contents of the register at that moment onto the stack; it doesn’t push a reference to the register or anything else. I’m not sure if you were saying that’s what was happening, but otherwise this diagram was unclear:
----------
|100|%esp
|104|%ebp
----------
Nevertheless, I’ll try to explain what it does and why.
Say %esp was 0x100 when the caller calls our function and the instruction after the call is at 0x200. When we execute call, we push 0x200 (the return address) and jump to the procedure. Our stack is then:
Address Value
%esp --> 0x100 0x200
And %ebp is some value or another; it might point into the stack or it might not. It doesn’t even need to represent an address. So %ebp is meaningless to us at this point.
But though it’s meaningless to us, the caller does expect it to stay the same before and after the call, so we have to preserve it. Let’s say it contained the value 0xDEADBEEF. We push it, so the stack now looks like this:
Address Value
0x100 0x 200
%esp --> 0x0fc 0xDEADBEEF
In most situations, we can address everything as an offset from %esp, and that applies here, too. But if the compiler is compiling some C code that deals with variable-length arrays or other features, we often will want to index from the first thing we pushed rather than the last thing we pushed. To do that, we’ll set %ebp to where we are right now. Then things look like this:
Address Value
0x100 0x 200
%esp, %ebp --> 0x0fc 0xDEADBEEF
Note that the value at the address pointed to by %ebp is the old value of %ebp, so you can walk the stack, as you mentioned you were aware of before.
Next, we push %ebx, which we’ll say has the value 0xBEEFCAFE. This is the first thing not directly related to a function prologue. Then our stack looks like this:
Address Value
0x100 0x 200
%ebp --> 0x0fc 0xDEADBEEF
%esp --> 0x0f8 0xBEEFCAFE
But why do we push %ebx? Well, as it turns out, the x86 C calling convention dictates that, like %ebp, %ebx must stay the same as it was before the call. So because the code you omitted presumably changes %ebx, it has to preserve the initial value so it can restore it for the caller.
After we’ve restored %ebx, we pop %ebp, restoring its value as well, since that, too, must be preserved after the call. And finally we return.
TL;DR: %ebp and %ebx are pushed and popped because they are manipulated during the execution of the body of the function, but the x86 C calling convention dictates that the values must remain the same before and after the call, so the initial values must be preserved so we can restore them.
pushl %ebp
Save the value of ebp on the stack. Any push command affects the value of %esp.
movl %esp, %ebp
Move the current value of esp into ebp. This sets the stack frame, you can now find function arguments above ebp (as the stack grows down).
pushl %ebx
Save the value of ebp (not 100% sure but most likely the ABI rules).
movl 8(%ebp), %ebx
Move the memory ebp+8 into ebx. As previously stated, since the stack grows down this is one of the function arguments.
movl 12(%ebp), %ecx
Similar to the previous instruction, this moves another function argument into ecx.
popl %ebx
Restore the value of ebx we saved on the stack earlier.
popl %ebp
And restore the value of ebp. At this point, there is a match pop for every push so the esp is back to what it was on function entry so we can return.

Help translating from assembly to C

I have some code from a function
subl $24, %esp
movl 8(%ebp), %eax
cmpl 12(%ebp), %eax
Before the code is just the 'ENTER' command and afterwards there's an if statement to return 1 if ebp > eax or 0 if it's less. I'm assuming cmpl means compare, but I can't tell what the concrete values are. Can anyone tell me what's happening?
Yes cmpl means compare (with 4-byte arguments). Suppose the piece of code is followed by a jg <addr>:
movl 8(%ebp), %eax
cmpl 12(%ebp), %eax
jg <addr>
Then the code is similar to
eax = ebp[8];
if (eax > ebp[12])
goto <addr>;
Your code fragment resembles the entry code used by some processors and compilers. The entry code is assembly code that a compiler issues when entering a function.
Entry code is responsible for saving function parameters and allocating space for local variables and optionally initializing them. The entry code uses pointers to the storage area of the variables. Some processors use a combination of the EBP and ESP registers to point to the location of the local variables (and function parameters).
Since the compiler knows where the variables (and function parameters) are stored, it drops the variable names and uses numerical indexing. For example, the line:
movl 8(%ebp), %eax
would either move the contents of the 8th local variable into the register EAX, or move the value at 8 bytes from the start of the local area (assuming the the EBP register pointers to the start of the local variable area).
The instruction:
subl $24, %esp
implies that the compiler is reserving 24 bytes on the stack. This could be to protect some information in the function calling convention. The function would be able to use the area after this for its own usage. This reserved area may contain function parameters.
The code fragment you supplied looks like it is comparing two local variables inside a function:
void Unknown_Function(long param1, long param2, long param3)
{
unsigned int local_variable_1;
unsigned int local_variable_2;
unsigned int local_variable_3;
if (local_variable_2 < local_variable_3)
{
//...
}
}
Try disassembling the above function and see how close it matches your code fragment.
This is a comparison between (EBP + 8) and (EBP + 12). Based on the comparison result, the cmpl instruction sets flags that are used by following jump instructions.
In Mac OS X 32 bit ABI EBP + 8 is the first function parameter, and EBP + 12 is the second parameter.

Resources