Deeper function profiling/emulation

Deeper function profiling/emulation - c

Hello everyone
Merry Christmas
I need an advice I have the following code:
int main()
{
int k=5000000;
int p;
int sum=0;
for (p=0;p<k;p++)
{
sum+=p;
}
return 0;
}
When I assemble it I get
main:
pushl %ebp
movl %esp, %ebp
subl $16, %esp
movl $5000000, -4(%ebp)
movl $0, -12(%ebp)
movl $0, -8(%ebp)
jmp .L2
.L3:
movl -8(%ebp), %eax
addl %eax, -12(%ebp)
addl $1, -8(%ebp)
.L2:
movl -8(%ebp), %eax
cmpl -4(%ebp), %eax
jl .L3
movl $0, %eax
leave
ret
If I run it through gprof I get that main executed the most, which is quite obvious!
Yet I want to go a step further and be able to know if L2, or L3 executed the most. here it is obvious that L3 executed the most. yet is there some kind of profiler, emulator that can give me that data for an entire code?

Well, if you don't mind low-tech, you can answer any such question either by single-stepping, or by this way.
For the latter method, it doesn't matter how big or complicated your program is.

Related

Storing Local variables in Assembly

So I've been working on a problem (and before you ask, yes, it is homework, but I've been putting in faithful effort!) where I have some assembly code and want to be able to convert it (as faithfully as possible) to C.
Here is the assembly code:
A1:
pushl %ebp
movl %esp, %ebp
subl $16, %esp
movl $0, -4(%ebp)
jmp .L2
.L4:
movl -4(%ebp), %eax
sall $2, %eax
addl 8(%ebp), %eax
movl (%eax), %eax
cmpl 12(%ebp), %eax
jg .L6
.L2:
movl -4(%ebp), %eax
cmpl 16(%ebp), %eax
jl .L4
jmp .L3
.L6:
nop
.L3:
movl -4(%ebp), %eax
leave
ret
And here's some of the C code I wrote to mimic it:
int A1(int a, int b, int c) {
int local = 0;
while(local < c) {
if(b > (int*)((local << 2) + a)) {
return local;
}
}
return local;
}
I have a few questions about how assembly works.
First, I notice that in L4, the body of the while loop, nothing is ever assigned to local. It's initialized to be 0 at the start of the function, and then never modified again. Looking at the C code I made for it, though, that seems odd, considering that the loop will go on indefinitely if the if-condition fails. Am I missing something there? I was under the impression that you'd need a snippet of code like:
movl %eax, -4(%ebp)
in order to actually assign anything to the local variable, and I don't see anything like that in the body of the while loop.
Secondly, you'll see that in the assembly code, the only local variable that's declared is "local". Hence, I have to use a snippet of code like:
if(b > (int*)((local << 2) + a))
The output of this line doesn't look much like the assembly code, though, and I think I might have made a mistake. What did I do wrong here?
And finally (thanks for your patience!), on a related note, I understand that the purpose of this if-loop in the while loop is to break out if the condition is fulfilled, and then to return local. Hence L6 and "nop" (which is basically saying nothing). However, I don't know how to replicate this in my program. I've tried "break", and I've tried returning local as you see here. I understand the functionality - I just don't know how to replicate it in C (short of using goto, but that kind of defeats the purpose of the exercise...).
Thank you for your time!

This is my guess:
int A1 (int *a, int value, int size)
{
int i = 0;
while (i<size)
{
if (a[i] <= value)
break;
}
return i;
}
Which, compiled back to assembly, gives me this code:
A1:
.LFB0:
pushl %ebp
movl %esp, %ebp
subl $16, %esp
movl $0, -4(%ebp)
jmp .L2
.L4:
movl -4(%ebp), %eax
leal 0(,%eax,4), %edx
movl 8(%ebp), %eax
addl %edx, %eax
movl (%eax), %eax
cmpl 12(%ebp), %eax
jg .L2
jmp .L3
.L2:
movl -4(%ebp), %eax
cmpl 16(%ebp), %eax
jl .L4
.L3:
movl -4(%ebp), %eax
leave
ret
Now this seems to be identical to your original ASM code, just the code starting at L4 is not the same, but if we anotate both codes:
ORIGINAL
movl -4(%ebp), %eax ;EAX = local
sall $2, %eax ;EAX = EAX*4
addl 8(%ebp), %eax ;EAX = EAX+a, hence EAX=a+local*4
ASM-C-ASM
movl -4(%ebp), %eax ;EAX = i
leal 0(,%eax,4), %edx ;EDX = EAX*4
movl 8(%ebp), %eax ;EAX = a
addl %edx, %eax ;EAX = EAX+EDX, hence EAX=a+i*4
Both codes continue with
movl (%eax), %eax
Because of this, I guess a is actually a pointer to some variable type that uses 4 bytes. By the comparison between the second argument and the value read from memory, I guess that type must be either int or long. I choose int solely by convenience.
Of course this also means that this code (and the original one) does not make any sense. It lacks the i++ part somewhere. If this is so, then a is an array, and the third argument is the size of the array. I've named my local variable i to keep with the tradition of naming index variables like this.
This code would scan the array searching for a value inside it that is equal or less than value. If it finds it, the index to that value is returned. If not, the size of the array is returned.

I'm trying to interpret this IA32 assembly language code

I have this IA32 assembly language code I'm trying to convert into regular C code.
.globl fn
.type fn, #function
fn:
pushl %ebp #setup
movl $1, %eax #setup 1 is in A
movl %esp, %ebp #setup
movl 8(%ebp), %edx # pointer X is in D
cmpl $1, %edx # (*x > 1)
jle .L4
.L5:
imull %edx, %eax
subl $1, %edx
cmpl $1, %edx
jne .L5
.L4:
popl %ebp
ret
The trouble I'm having is deciding what type of comparison is going on. I don't get how the program gets to the L5 cache. L5 seems to be a loop since there's a comparison within it. I'm also unsure of what is being returned because it seems like most of the work is done is the %edx register, but doesn't go back to %eax for returning.
What I have so far:
int fn(int x)
{
}

It looks to me like it's computing a factorial. Ignoring the stack frame manipulation and such, we're left with:
movl $1, %eax #setup 1 is in A
Puts 1 into eax.
movl 8(%ebp), %edx # pointer X is in D
Retrieves a parameter into edx
imull %edx, %eax
Multiplies eax by edx, putting the result into eax.
subl $1, %edx
cmpl $1, %edx
jne .L5
Decrements edx and repeats if edx != 1.
In other words, this is roughly equivalent to:
unsigned fact(unsigned input) {
unsigned retval = 1;
for ( ; input != 1; --input)
retval *= input;
return retval;
}

Coding in C: efficiency of temporary local variables

I was wondering: programming in C, let's say we have two functions:
int get_a_value();
int calculate_something(int number);
And two versions of a third one:
/* version 1 */
int main()
{
int value = get_a_value();
int result = calculate_something(value);
return result;
}
/* version 2 */
int main()
{
int result = calculate_something(get_a_value());
return result;
}
Teorethically, what would be the difference between these two versions of the same thing, in terms of correctness, memory use and efficiency? Would they generate different instructions? On the other hand, what circumstances would make the possible differences significant in reality?
Thanks in advance.

Copied both versions, compiled each with gcc -S to get the machine language output, used sdiff to compare side-by-side.
Results using gcc version 4.1.2 20070115 (SUSE Linux):
No optimization:
main: main:
.LFB2: .LFB2:
pushq %rbp pushq %rbp
.LCFI0: .LCFI0:
movq %rsp, %rbp movq %rsp, %rbp
.LCFI1: .LCFI1:
subq $16, %rsp subq $16, %rsp
.LCFI2: .LCFI2:
movl $0, %eax movl $0, %eax
call get_a_value call get_a_value
movl %eax, -8(%rbp) | movl %eax, %edi
movl -8(%rbp), %edi <
movl $0, %eax movl $0, %eax
call calculate_something call calculate_something
movl %eax, -4(%rbp) movl %eax, -4(%rbp)
movl -4(%rbp), %eax movl -4(%rbp), %eax
leave leave
ret ret
Basically, one extra move instruction. Both allocate the same amount of stack space (subq $16, %rsp reserves 16 bytes for the stack), so memory-wise there's no difference.
Level 1 optimization (-O1):
main: main:
.LFB2: .LFB2:
subq $8, %rsp subq $8, %rsp
.LCFI0: .LCFI0:
movl $0, %eax movl $0, %eax
call get_a_value call get_a_value
movl %eax, %edi movl %eax, %edi
movl $0, %eax movl $0, %eax
call calculate_something call calculate_something
addq $8, %rsp addq $8, %rsp
ret ret
No differences.
Results using gcc version 2.96 20000731 (Red Hat Linux 7.2 2.96-112.7.2):
No optimization:
main: main:
pushl %ebp pushl %ebp
movl %esp, %ebp movl %esp, %ebp
subl $8, %esp subl $8, %esp
> subl $12, %esp
> subl $4, %esp
call get_a_value call get_a_value
> addl $4, %esp
movl %eax, %eax movl %eax, %eax
movl %eax, -4(%ebp) | pushl %eax
subl $12, %esp <
pushl -4(%ebp) <
call calculate_something call calculate_something
addl $16, %esp addl $16, %esp
movl %eax, %eax movl %eax, %eax
movl %eax, -8(%ebp) | movl %eax, -4(%ebp)
movl -8(%ebp), %eax | movl -4(%ebp), %eax
movl %eax, %eax movl %eax, %eax
leave leave
ret ret
Roughly the same number of instructions, ordered slightly differently.
Level 1 optimization (-O1):
main: main:
pushl %ebp pushl %ebp
movl %esp, %ebp movl %esp, %ebp
subl $8, %esp | subl $24, %esp
call get_a_value call get_a_value
subl $12, %esp | movl %eax, (%esp)
pushl %eax <
call calculate_something call calculate_something
leave leave
ret ret
Looks like the second version reserves a little more stack space.
So, for this particular example with these particular compilers, there's no huge difference between the two versions. In that case, I'd favor the first version for the following reasons:
Easier to trace in a debugger; you can examine the value returned from get_a_value before passing it to calculate_something;
It gives you a place to do some sanity checking, in case calculate_something isn't well-behaved for certain inputs;
It's a little easier on the eyes.
Just remember that terse doesn't necessarily mean fast or efficient, and what's fast/efficient under one particular compiler/hardware combination may be hopelessly busted under a different compiler/hardware combination. Some compilers actually have an easier time optimizing code that's written in a clear manner.
Your code should be, in order:
Correct - it doesn't matter how fast it is or how little memory it uses if it doesn't meet its requirements;
Secure - it doesn't matter how fast it is or how little memory it uses if it's a malware vector or risks exposing sensitive data to unauthorized parties (yes, I'm talking about Heart-frickin'-bleed);
Robust - it doesn't mattter how fast it is or how little memory it uses if it dumps core because somebody sneezed in a different room;
Maintainable - it doesn't matter how fast it is or how little memory it uses if it has to be scrapped and rewritten because the requirements changed (which they do);
Efficient - now you can start worrying about performance and efficiency.

I performed a little test, generating assembler code for the 2 versions. Simply running a diff command from bash showed that the first version has 2 instructions more than the second one.
If you want to try by yourself simply compile with this command
gcc -S main.c -o asmout.s
gcc -S main2.c -o asmout2.s
and then check differences with
diff asmout.s asmout2.s
I got these 2 instructions more for the first one:
movl %eax, -8(%rbp)
movl -8(%rbp), %eax
EDIT:
As Keith Thompson suggested if compiled with optimization options the generated assembler code is the same for both versions.

It really depends on the platform and the compiler, but with optimization on they should usually generate the same code. At worst version one will allocate space for an extra int. If placing the value of get_a_value in a variable makes your code more readable then I would go ahead and do that. The only time I would advise not doing so is in a deeply recursive function.

Why does GCC produce the following asm output?

I don't understand why gcc -S -m32 produces these particular lines of code:
movl %eax, 28(%esp)
movl $desc, 4(%esp)
movl 28(%esp), %eax
movl %eax, (%esp)
call sort_gen_asm
My question is why %eax is pushed and then popped? And why movl used instead of pushl and popl respectively? Is it faster? Is there some coding convention I don't yet know? I've just started looking at asm-output closely, so I don't know much.
The C code:
void print_array(int *data, size_t sz);
void sort_gen_asm(array_t*, comparer_t);
int main(int argc, char *argv[]) {
FILE *file;
array_t *array;
file = fopen("test", "rb");
if (file == NULL) {
err(EXIT_FAILURE, NULL);
}
array = array_get(file);
sort_gen_asm(array, desc);
print_array(array->data, array->sz);
array_destroy(array);
fclose(file);
return 0;
}
It gives this output:
.file "main.c"
.section .rodata
.LC0:
.string "rb"
.LC1:
.string "test"
.text
.globl main
.type main, #function
main:
.LFB2:
.cfi_startproc
pushl %ebp
.cfi_def_cfa_offset 8
.cfi_offset 5, -8
movl %esp, %ebp
.cfi_def_cfa_register 5
andl $-16, %esp
subl $32, %esp
movl $.LC0, 4(%esp)
movl $.LC1, (%esp)
call fopen
movl %eax, 24(%esp)
cmpl $0, 24(%esp)
jne .L2
movl $0, 4(%esp)
movl $1, (%esp)
call err
.L2:
movl 24(%esp), %eax
movl %eax, (%esp)
call array_get
movl %eax, 28(%esp)
movl $desc, 4(%esp)
movl 28(%esp), %eax
movl %eax, (%esp)
call sort_gen_asm
movl 28(%esp), %eax
movl 4(%eax), %edx
movl 28(%esp), %eax
movl (%eax), %eax
movl %edx, 4(%esp)
movl %eax, (%esp)
call print_array
movl 28(%esp), %eax
movl %eax, (%esp)
call array_destroy
movl 24(%esp), %eax
movl %eax, (%esp)
call fclose
movl $0, %eax
leave
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
.cfi_endproc
.LFE2:
.size main, .-main
.ident "GCC: (Ubuntu/Linaro 4.8.1-10ubuntu8) 4.8.1"
.section .note.GNU-stack,"",#progbits

The save / load of eax is because you did not compile with optimizations. So any read/write of a variable will emit a read/write of a memory address.
Actually, for (almost) any line of code you will be able to identify the exact piece of assembler code resulting from it (let me advise you to compile with gcc -g -c -O0 and then objdump -S file.o):
#array = array_get(file);
call array_get
movl %eax, 28(%esp) #write array
#sort_gen_asm(array, desc);
movl 28(%esp), %eax #read array
movl %eax, (%esp)
...
About not pushing/poping, it is a standard zero-cost optimization. Instead of push/pop every time you want to call a function you just substract the maximum needed space to esp at the beginning of the function and then save your function arguments at the bottom of the empty space. There are a lot of advantages: faster code (no changing esp), it doesn't need to compute the argument in any particular order, and the esp will need to be substracted anyway for the local variables space.

Some things have to do with calling conventions. Others with optimisations.
sort_gen_asm seems to use cdecl calling convention which requires it's arguments to be pushed onto the stack in reverse order. thus:
movl $desc, 4(%esp)
movl %eax, (%esp)
The other moves are partially unoptimised compiler routines:
movl %eax, 28(%esp) # save contents of %eax on the stack before calling
movl 28(%esp), %eax # retrieve saved 28(%esp) in order to prepare it as an argument
# Unoptimised compiler seems to have forgotten that it's
# still in the register

i have a question about stack frame structure?

i'm a new one to learn assembly. i write a c file:
#include <stdlib.h>
int max( int c )
{
int d;
d = c + 1;
return d;
}
int main( void )
{
int a = 0;
int b;
b = max( a );
return 0;
}
and i use gcc -S as01.c and create a assembly file.
.file "as01.c"
.text
.globl max
.type max, #function
max:
pushl %ebp
movl %esp, %ebp
subl $32, %esp
movl $0, -4(%ebp)
movl $1, -24(%ebp)
movl $2, -20(%ebp)
movl $3, -16(%ebp)
movl $4, -12(%ebp)
movl $6, -8(%ebp)
movl 8(%ebp), %eax
addl $1, %eax
movl %eax, -4(%ebp)
movl -4(%ebp), %eax
leave
ret
.size max, .-max
.globl main
.type main, #function
main:
pushl %ebp
movl %esp, %ebp
subl $20, %esp
movl $0, -4(%ebp)
movl -4(%ebp), %eax
movl %eax, (%esp)
call max
movl %eax, -8(%ebp)
"as01.s" 38L, 638C
i' confused, beacuse movl %eax, -4(%ebp) movl -4(%ebp), %eax in max(),
i know that %eax is used for returning the value of any function.
I think %eax is a temporarily register for store the c + 1.
This is right?
thank you for your answer.

You don't have optimisation turned on, so the compiler is generating really bad code. The primary storage for all your values is in the stack frame, and values are loaded into registers only long enough to do the calculations.
The code actually breaks down into:
pushl %ebp
movl %esp, %ebp
subl $32, %esp
Standard function prologue, setting up a new stack frame, and reserving 50 bytes for the stack frame.
movl $0, -4(%ebp)
movl $1, -24(%ebp)
movl $2, -20(%ebp)
movl $3, -16(%ebp)
movl $4, -12(%ebp)
movl $6, -8(%ebp)
Fill the stack frame with dummy values (presumably as a debugging aid).
movl 8(%ebp), %eax
addl $1, %eax
movl %eax, -4(%ebp)
Read the parameter c out of the stack frame, add one to it, store it into a (different) stack slot.
movl -4(%ebp), %eax
leave
ret
Read the value back out of the stack slot and return it.
If you compile this with optimisation, you'll see most of the code vanish. If you use -fomit-frame-pointer -Os, you should end up with this:
max:
movl 4(%esp), %eax
incl %eax
ret

movl %eax, -4(%ebp)
Here the value computed for d (now stored in eax) is saved in d's memory cell.
movl -4(%ebp), %eax
While here the return value (d's) gets loaded into eax, because, as you know, eax holds functions' return value.
As #David said, you're compiling without optimization, so gcc generates easy-to-debug code, which is quite inefficient and repetitive sometimes.