Translating Assembly to C program

Translating Assembly to C program - c

Here is the assembly code I'm looking at:
push %ebp
mov %esp, %epb
sub $0x18, %esp
movl $0x804a170, 0x4(%esp)
mov 0x8(%ebp), %eax
mov %eax, (%esp)
call 0x8048f9b <strings_not_equal>
test %eax, %eax
je 0x8048f78 <phase_1+34>
call 0x8049198 <failed>
leave
ret
The goal is to give the function of the assembly the correct input as to jump to the function phase_1+34. So here is what I am interpreting so far:
The first two lines are set up code for the function. The first 'sub' line is allocating space by moving '%esp' downward to store arguments to call the strings_not_equal function. I think that the two arguments being passed to strings_not_equal are 0x804a170 and the input value. I assume that <strings_not_equal> will return 0 if the passed strings are equal. The je then checks to see if %eax & %eax is zero, which will only happen if %eax = 0. So basically, it seems that the input string just has to be equal to 0x804a170.
Anyone see any flaws so far?
At this point, I'm stuck, and what I have tried isn't working. 0x804a170 in decimal is 134521200. But the function this is being passed to is expecting strings. So 0x804a170 should be converted to a string? and that string is what the input should be equal to?
I dunno. If anyone sees any flaws or can give me a pointer it is very much appreciated!

0x804a170 is a pointer to something considered a "string" by the runtime. Since you give no information about what it is, we can only guess it's either a pointer to the first character, followed by more characters and ending with a 0, or it's a pointer to the first character, and the previous 1/2/4/8 bytes describe the length as an integer.
It's not a string itself.

Commented the code:
push %ebp ;function name is apparently phase_1
mov %esp, %epb
sub $0x18, %esp ;allocate 24 bytes from stack
movl $0x804a170, 0x4(%esp) ;[esp+4] = pointer to literal, static, or global string
mov 0x8(%ebp), %eax ;[esp ] = [ebp + 8] (this and next instruction)
mov %eax, (%esp)
call 0x8048f9b <strings_not_equal>
test %eax, %eax
je 0x8048f78 <phase_1+34> ;probably branches to next instruction after ret
call 0x8049198 <failed>
leave
ret
... ;padding here, nop (0x90) or int 3 (0xcc) are common
phase_1 + 34 ... ;continuation of phase_1

Related

Segmentation fault when calling x86 Assembly function from C program

I am writing a C program that calls an x86 Assembly function which adds two numbers. Below are the contents of my C program (CallAssemblyFromC.c):
#include <stdio.h>
#include <stdlib.h>
int addition(int a, int b);
int main(void) {
int sum = addition(3, 4);
printf("%d", sum);
return EXIT_SUCCESS;
}
Below is the code of the Assembly function (my idea is to code from scratch the stack frame prologue and epilogue, I have added comments to explain the logic of my code) (addition.s):
.text
# Here, we define a function addition
.global addition
addition:
# Prologue:
# Push the current EBP (base pointer) to the stack, so that we
# can reset the EBP to its original state after the function's
# execution
push %ebp
# Move the EBP (base pointer) to the current position of the ESP
# register
movl %esp, %ebp
# Read in the parameters of the addition function
# addition(a, b)
#
# Since we are pushing to the stack, we need to obtain the parameters
# in reverse order:
# EBP (return address) | EBP + 4 (return value) | EBP + 8 (b) | EBP + 4 (a)
#
# Utilize advanced indexing in order to obtain the parameters, and
# store them in the CPU's registers
movzbl 8(%ebp), %ebx
movzbl 12(%ebp), %ecx
# Clear the EAX register to store the sum
xorl %eax, %eax
# Add the values into the section of memory storing the return value
addl %ebx, %eax
addl %ecx, %eax
I am getting a segmentation fault error, which seems strange considering that I think I am allocating memory in accordance with the x86 calling conventions (e.x. allocating the correct memory sections to the function's parameters). Furthermore, if any of you have a solution, it would be greatly appreciated if you could provide some advice as to how to debug an Assembly program embedded with C (I have been using the GDB debugger but it simply points to the line of the C program where the segmentation fault happens instead of the line in the Assembly program).

Your function has no epilogue. You need to restore %ebp and pop the stack back to where it was, and then ret. If that's really missing from your code, then that explains your segfault: the CPU will go on executing whatever garbage happens to be after the end of your code in memory.
You clobber (i.e. overwrite) the %ebx register which is supposed to be callee-saved. (You mention following the x86 calling conventions, but you seem to have missed that detail.) That would be the cause of your next segfault, after you fixed the first one. If you use %ebx, you need to save and restore it, e.g. with push %ebx after your prologue and pop %ebx before your epilogue. But in this case it is better to rewrite your code so as not to use it at all; see below.
movzbl loads an 8-bit value from memory and zero-extends it into a 32-bit register. Here the parameters are int so they are already 32 bits, so plain movl is correct. As it stands your function would give incorrect results for any arguments which are negative or larger than 255.
You're using an unnecessary number of registers. You could move the first operand for the addition directly into %eax rather than putting it into %ebx and adding it to zero. And on x86 it is not necessary to get both operands into registers before adding; arithmetic instructions have a mem, reg form where one operand can be loaded directly from memory. With this approach we don't need any registers other than %eax itself, and in particular we don't have to worry about %ebx anymore.
I would write:
.text
# Here, we define a function addition
.global addition
addition:
# Prologue:
push %ebp
movl %esp, %ebp
# load first argument
movl 8(%ebp), %eax
# add second argument
addl 12(%ebp), %eax
# epilogue
movl %ebp, %esp # redundant since we haven't touched esp, but will be needed in more complex functions
pop %ebp
ret
In fact, you don't need a stack frame for this function at all, though I understand if you want to include it for educational value. But if you omit it, the function can be reduced to
.text
.global addition
addition:
movl 4(%esp), %eax
addl 8(%esp), %eax
ret

You are corrupting the stacke here:
movb %al, 4(%ebp)
To return the value, simply put it in eax. Also why do you need to clear eax? that's inefficient as you can load the first value directly into eax and then add to it.
Also EBX must be saved if you intend to use it, but you don't really need it anyway.

Using "lea ($30, %edx), %eax to add edx and 30 and put it into eax

I'm studying for a midterm tomorrow and one of the questions on a previous midterm is:
Consider the following C function. Write the corresponding assembly language function to perform the same operation.
int myFunction (int a)
{
return (a + 30);
}
What I wrote down is:
.global _myFunction
_myFunction:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %edx
lea ($30, %edx), %eax
leave
ret
where a is edx and a+30 would be eax. Is the use of lea correct in this case? Would it instead need to be
lea ($30, %edx, 1), %eax
Thanks.

If you are looking to simply add 30 using leal then you should do it this way:
leal 30(%edx), %eax
The notation is displacement(baseregister, offsetregister, scalarmultiplier). The displacement is placed on the outside. 30 is added to edx and stored in eax. In AT&T/GAS notation you can leave off both the offset and multiplier. In our example this leaves us with the equivalent of base + displacement or edx + 30 in this example.
cHao also brings up a good point. Let us say the professor asks you to optimize your code. There are some inefficiencies in the fact that myFunction uses no local variables and doesn't need stack space for itself. Because of that all the stack frame creation and destruction can be removed. If you remove the stack frame then you no longer push %ebp as well. That means your first parameter int a is at 4(%esp) . With that in mind your function can be reduced to something this simple:
.global _myFunction
_myFunction:
movl 4(%esp), %eax
addl $30, %eax
ret
Of course the moment you change your function so that it needs to store things on the stack you would have to put the stack frame code back in (pushl %ebp, pushl %ebp, leave etc)

C Code represented as Assembler Code - How to interpret?

I got this short C Code.
#include <stdint.h>
uint64_t multiply(uint32_t x, uint32_t y) {
uint64_t res;
res = x*y;
return res;
}
int main() {
uint32_t a = 3, b = 5, z;
z = multiply(a,b);
return 0;
}
There is also an Assembler Code for the given C code above.
I don't understand everything of that assembler code. I commented each line and you will find my question in the comments for each line.
The Assembler Code is:
.text
multiply:
pushl %ebp // stores the stack frame of the calling function on the stack
movl %esp, %ebp // takes the current stack pointer and uses it as the frame for the called function
subl $16, %esp // it leaves room on the stack, but why 16Bytes. sizeof(res) = 8Bytes
movl 8(%ebp), %eax // I don't know quite what "8(%ebp) mean? It has to do something with res, because
imull 12(%ebp), %eax // here is the multiplication done. And again "12(%ebp).
movl %eax, -8(%ebp) // Now, we got a negative number in front of. How to interpret this?
movl $0, -4(%ebp) // here as well
movl -8(%ebp), %eax // and here again.
movl -4(%ebp), %edx // also here
leave
ret
main:
pushl %ebp // stores the stack frame of the calling function on the stack
movl %esp, %ebp // // takes the current stack pointer and uses it as the frame for the called function
andl $-8, %esp // what happens here and why?
subl $24, %esp // here, it leaves room for local variables, but why 24 bytes? a, b, c: the size of each of them is 4 Bytes. So 3*4 = 12
movl $3, 20(%esp) // 3 gets pushed on the stack
movl $5, 16(%esp) // 5 also get pushed on the stack
movl 16(%esp), %eax // what does 16(%esp) mean and what happened with z?
movl %eax, 4(%esp) // we got the here as well
movl 20(%esp), %eax // and also here
movl %eax, (%esp) // what does happen in this line?
call multiply // thats clear, the function multiply gets called
movl %eax, 12(%esp) // it looks like the same as two lines before, except it contains the number 12
movl $0, %eax // I suppose, this line is because of "return 0;"
leave
ret

Negative references relative to %ebp are for local variables on the stack.
movl 8(%ebp), %eax // I don't know quite what "8(%ebp) mean? It has to do something with res, because`
%eax = x
imull 12(%ebp), %eax // here is the multiplication done. And again "12(%ebp).
%eax = %eax * y
movl %eax, -8(%ebp) // Now, we got a negative number in front of. How to interpret this?
(u_int32_t)res = %eax // sets low 32 bits of res
movl $0, -4(%ebp) // here as well
clears upper 32 bits of res to extend 32-bit multiplication result to uint64_t
movl -8(%ebp), %eax // and here again.
movl -4(%ebp), %edx // also here
return ret; //64-bit results are returned as a pair of 32-bit registers %edx:%eax
As for the main, see x86 calling convention which may help making sense of what happens.
andl $-8, %esp // what happens here and why?
stack boundary is aligned by 8. I believe it's ABI requirement
subl $24, %esp // here, it leaves room for local variables, but why 24 bytes? a, b, c: the size of each of them is 4 Bytes. So 3*4 = 12
Multiples of 8 (probably due to alignment requirements)
movl $3, 20(%esp) // 3 gets pushed on the stack
a = 3
movl $5, 16(%esp) // 5 also get pushed on the stack
b = 5
movl 16(%esp), %eax // what does 16(%esp) mean and what happened with z?
%eax = b
z is at 12(%esp) and is not used yet.
movl %eax, 4(%esp) // we got the here as well
put b on the stack (second argument to multiply())
movl 20(%esp), %eax // and also here
%eax = a
movl %eax, (%esp) // what does happen in this line?
put a on the stack (first argument to multiply())
call multiply // thats clear, the function multiply gets called
multiply returns 64-bit result in %edx:%eax
movl %eax, 12(%esp) // it looks like the same as two lines before, except it contains the number 12
z = (uint32_t) multiply()
movl $0, %eax // I suppose, this line is because of "return 0;"
yup. return 0;

Arguments are pushed onto the stack when the function is called. Inside the function, the stack pointer at that time is saved as the base pointer. (You got that much already.) The base pointer is used as a fixed location from which to reference arguments (which are above it, hence the positive offsets) and local variables (which are below it, hence the negative offsets).
The advantage of using a base pointer is that it is stable throughout the entire function, even when the stack pointer changes (due to function calls and new scopes).
So 8(%ebp) is one argument, and 12(%ebp) is the other.
The code is likely using more space on the stack than it needs to, because it is using temporary variables that could be optimized out of you had optimization turned on.
You might find this helpful: http://en.wikibooks.org/wiki/X86_Disassembly/Functions_and_Stack_Frames

I started typing this as a comment but it was getting too long to fit.
You can compile your example with -masm=intel so the assembly is more readable. Also, don't confuse the push and pop instructions with mov. push and pop always increments and decrements esp respectively before derefing the address whereas mov does not.
There are two ways to store values onto the stack. You can either push each item onto it one item at a time or you can allocate up-front the space required and then load each value onto the stackslot using mov + relative offset from either esp or ebp.
In your example, gcc chose the second method since that's usually faster because, unlike the first method, you're not constantly incrementing esp before saving the value onto the stack.
To address your other question in comment, x86 instruction set does not have a mov instruction for copying values from memory location a to another memory location b directly. It is not uncommon to see code like:
mov eax, [esp+16]
mov [esp+4], eax
mov eax, [esp+20]
mov [esp], eax
call multiply(unsigned int, unsigned int)
mov [esp+12], eax
Register eax is being used as an intermediate temporary variable to help copy data between the two stack locations. You can mentally translate the above as:
esp[4] = esp[16]; // argument 2
esp[0] = esp[20]; // argument 1
call multiply
esp[12] = eax; // eax has return value
Here's what the stack approximately looks like right before the call to multiply:
lower addr esp => uint32_t:a_copy = 3 <--. arg1 to 'multiply'
esp + 4 uint32_t:b_copy = 5 <--. arg2 to 'multiply'
^ esp + 8 ????
^ esp + 12 uint32_t:z = ? <--.
| esp + 16 uint32_t:b = 5 | local variables in 'main'
| esp + 20 uint32_t:a = 3 <--.
| ...
| ...
higher addr ebp previous frame

How to interpret this IA32 assembly code [duplicate]

This question already has an answer here:
Pointers in Assembly
(1 answer)
Closed 9 years ago.
As an assignment, I must read the assembly code and write its high-level C program. The structure in this case is a switch statement, so for each case, the assembly code is translated to the case in C code. Below will only be one of the cases. If you can help me interpret this, it should give me a better understanding of the rest of the cases.
p1 is in %ebp+8
p2 is in %ebp+12
action is in %ebp+16
result is in %edx
...
.L13:
movl 8(%ebp), %eax # get p1
movl (%eax), %edx # result = *p1?
movl 12(%ebp), %ecx # get p2
movl (%ecx), %eax # p1 = *p2
movl 8(%ebp), %ecx # p2 = p1
movl %eax, (%ecx) # *p1 = *p2?
jmp .L19 #jump to default
...
.L19
movl %edx, %eax # set return value
Of course, the comments were added by me to try to make sense of it, except it leaves me more confused. Is this meant to be a swap? Probably not; the formatting would be different. What really happens in the 2nd and 6th lines? Why is %edx only changed once so early if it's the return value? Please answer with some guidelines to interpreting this code.

The above snippet is x86_32 assembly in the (IMO broken) AT&T syntax.
AT&T syntax attaches a size suffix to every operand.
movl means a 32 bit operand. (l for long)
movw means 16 bit operand (w for word)
movb means an 8 bit operand (b for byte)
The operands are reversed in order, so the destination is on the right and the source is on the left.
This is opposite to almost every other programming language.
Register names are prefixed by a % to distinguish them from variable names.
If a register is surrounded by brackets () that means that the memory address pointed to by the register is used, rather than the value inside the register itself.
This makes sense, because EBP is used as a pointer to a stackframe.
Stackframes are used to access parameters and local variables in functions.
Instead of writing: mov eax, dword ptr [ebp+8] (Intel syntax)
AT&T syntax lists it as: movl 8(%ebp), %eax (gas syntax)
Which means: put the contends of the memory address pointed to by (ebp + 8) into eax.
Here's the translation:
.L13: <<-- label used as a jump target.
movl 8(%ebp), %eax <<-- p1, stored at ebp+8 goes into EAX
movl (%eax), %edx <<-- p1 is a pointer, EDX = p1->next
movl 12(%ebp), %ecx <<-- p2, stored at ebp+12 goes in ECX
movl (%ecx), %eax <<-- p2 is (again) a pointer, EAX = p2->next
movl 8(%ebp), %ecx <<-- ECX = p1
movl %eax, (%ecx) <<-- p2->next = p1->next
jmp .L19 <<-- jump to exit
...
.L19
movl %edx, %eax <<-- EAX is always the return value
<<-- return p1->data.
In all of the many calling conventions on x86 the return value of a function is put into the EAX register. (or EAX:EDX if it's an INT64)
In prose: p1 and p2 are pointers to data, in this data pointers to pointers.
This code looks like it manipulates a linked list.
p2->next is set to p1->next.
Other than that the snippet looks incomplete because whatever was in p2->next to begin with is not worked on so there probably more code that you're not showing.
Apart from the confusing AT&T syntax it's really very simple code.
In C the code would look like:
(void *)p2->next = (void *)p1->next;
Note that the code is quite inefficient and no decent compiler (or human) would generate this code.
The following equivalent would make more sense:
mov eax,[ebp+8]
mov ecx,[ebp+12]
mov eax,[eax]
mov [ecx],eax
jmp done
More info on the difference between AT&T and Intel syntax can be found here: http://www.ibm.com/developerworks/linux/library/l-gas-nasm/index.html

Decoding equivalent assembly code of C code

Wanting to see the output of the compiler (in assembly) for some C code, I wrote a simple program in C and generated its assembly file using gcc.
The code is this:
#include <stdio.h>
int main()
{
int i = 0;
if ( i == 0 )
{
printf("testing\n");
}
return 0;
}
The generated assembly for it is here (only the main function):
_main:
pushl %ebpz
movl %esp, %ebp
subl $24, %esp
andl $-16, %esp
movl $0, %eax
addl $15, %eax
addl $15, %eax
shrl $4, %eax
sall $4, %eax
movl %eax, -8(%ebp)
movl -8(%ebp), %eax
call __alloca
call ___main
movl $0, -4(%ebp)
cmpl $0, -4(%ebp)
jne L2
movl $LC0, (%esp)
call _printf
L2:
movl $0, %eax
leave
ret
I am at an absolute loss to correlate the C code and assembly code. All that the code has to do is store 0 in a register and compare it with a constant 0 and take suitable action. But what is going on in the assembly?

Since main is special you can often get better results by doing this type of thing in another function (preferably in it's own file with no main). For example:
void foo(int x) {
if (x == 0) {
printf("testing\n");
}
}
would probably be much more clear as assembly. Doing this would also allow you to compile with optimizations and still observe the conditional behavior. If you were to compile your original program with any optimization level above 0 it would probably do away with the comparison since the compiler could go ahead and calculate the result of that. With this code part of the comparison is hidden from the compiler (in the parameter x) so the compiler can't do this optimization.
What the extra stuff actually is
_main:
pushl %ebpz
movl %esp, %ebp
subl $24, %esp
andl $-16, %esp
This is setting up a stack frame for the current function. In x86 a stack frame is the area between the stack pointer's value (SP, ESP, or RSP for 16, 32, or 64 bit) and the base pointer's value (BP, EBP, or RBP). This is supposedly where local variables live, but not really, and explicit stack frames are optional in most cases. The use of alloca and/or variable length arrays would require their use, though.
This particular stack frame construction is different than for non-main functions because it also makes sure that the stack is 16 byte aligned. The subtraction from ESP increases the stack size by more than enough to hold local variables and the andl effectively subtracts from 0 to 15 from it, making it 16 byte aligned. This alignment seems excessive except that it would force the stack to also start out cache aligned as well as word aligned.
movl $0, %eax
addl $15, %eax
addl $15, %eax
shrl $4, %eax
sall $4, %eax
movl %eax, -8(%ebp)
movl -8(%ebp), %eax
call __alloca
call ___main
I don't know what all this does. alloca increases the stack frame size by altering the value of the stack pointer.
movl $0, -4(%ebp)
cmpl $0, -4(%ebp)
jne L2
movl $LC0, (%esp)
call _printf
L2:
movl $0, %eax
I think you know what this does. If not, the movl just befrore the call is moving the address of your string into the top location of the stack so that it may be retrived by printf. It must be passed on the stack so that printf can use it's address to infer the addresses of printf's other arguments (if any, which there aren't in this case).
leave
This instruction removes the stack frame talked about earlier. It is essentially movl %ebp, %esp followed by popl %ebp. There is also an enter instruction which can be used to construct stack frames, but gcc didn't use it. When stack frames aren't explicitly used, EBP may be used as a general puropose register and instead of leave the compiler would just add the stack frame size to the stack pointer, which would decrease the stack size by the frame size.
ret
I don't need to explain this.
When you compile with optimizations
I'm sure you will recompile all fo this with different optimization levels, so I will point out something that may happen that you will probably find odd. I have observed gcc replacing printf and fprintf with puts and fputs, respectively, when the format string did not contain any % and there were no additional parameters passed. This is because (for many reasons) it is much cheaper to call puts and fputs and in the end you still get what you wanted printed.

Don't worry about the preamble/postamble - the part you're interested in is:
movl $0, -4(%ebp)
cmpl $0, -4(%ebp)
jne L2
movl $LC0, (%esp)
call _printf
L2:
It should be pretty self-evident as to how this correlates with the original C code.

The first part is some initialization code, which does not make any sense in the case of your simple example. This code would be removed with an optimization flag.
The last part can be mapped to C code:
movl $0, -4(%ebp) // put 0 into variable i (located at -4(%ebp))
cmpl $0, -4(%ebp) // compare variable i with value 0
jne L2 // if they are not equal, skip to after the printf call
movl $LC0, (%esp) // put the address of "testing\n" at the top of the stack
call _printf // do call printf
L2:
movl $0, %eax // return 0 (calling convention: %eax has the return code)

Well, much of it is the overhead associated with the function. main() is just a function like any other, so it has to store the return address on the stack at the start, set up the return value at the end, etc.
I would recommend using GCC to generate mixed source code and assembler which will show you the assembler generated for each sourc eline.
If you want to see the C code together with the assembly it was converted to, use a command line like this:
gcc -c -g -Wa,-a,-ad [other GCC options] foo.c > foo.lst
See http://www.delorie.com/djgpp/v2faq/faq8_20.html
On linux, just use gcc. On Windows down load Cygwin http://www.cygwin.com/
Edit - see also this question Using GCC to produce readable assembly?
and http://oprofile.sourceforge.net/doc/opannotate.html

You need some knowledge about Assembly Language to understand assembly garneted by C compiler.
This tutorial might be helpful

See here more information. You can generate the assembly code with C comments for better understanding.
gcc -g -Wa,-adhls your_c_file.c > you_asm_file.s
This should help you a little.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight