Reading x86 assembly code

Reading x86 assembly code - c

I am working through a lab where I have to defuse a "bomb" by providing the correct input for each phase. I do not have access to the source code, so I have to step through the assembly code with GDB. Right now, I'm stuck on phase 2 and would really appreciate some help. Here is the x86 assembly code - I've added some comments that describe what I think is happening, but these could be horribly wrong because we only started learning assembly code a few days ago and I'm still quite shaky. As far as I can tell right now, this phase reads in 6 numbers from the user (that's what read_six_numbers does) and seems to go through some type of loop.
0000000000400f03 <phase_2>:
400f03: 41 55 push %r13 // save values
400f05: 41 54 push %r12
400f07: 55 push %rbp
400f08: 53 push %rbx
400f09: 48 83 ec 28 sub $0x28,%rsp // decrease stack pointer
400f0d: 48 89 e6 mov %rsp,%rsi // move rsp to rsi
400f10: e8 5a 07 00 00 callq 40166f <read_six_numbers> // read in six numbers from the user
400f15: 48 89 e3 mov %rsp,%rbx // move rsp to rbx
400f18: 4c 8d 64 24 0c lea 0xc(%rsp),%r12 // ?
400f1d: bd 00 00 00 00 mov $0x0,%ebp // set ebp to 0?
400f22: 49 89 dd mov %rbx,%r13 // move rbx to r13
400f25: 8b 43 0c mov 0xc(%rbx),%eax // ?
400f28: 39 03 cmp %eax,(%rbx) // compare eax and rbx
400f2a: 74 05 je 400f31 <phase_2+0x2e> // if equal, skip explode
400f2c: e8 1c 07 00 00 callq 40164d <explode_bomb> // bomb detonates (fail)
400f31: 41 03 6d 00 add 0x0(%r13),%ebp // add r13 and ebp (?)
400f35: 48 83 c3 04 add $0x4,%rbx // add 4 to rbx
400f39: 4c 39 e3 cmp %r12,%rbx // compare r12 and rbx
400f3c: 75 e4 jne 400f22 <phase_2+0x1f> // loop? if not equal, jump to 400f22
400f3e: 85 ed test %ebp,%ebp // compare ebp with itself?
400f40: 75 05 jne 400f47 <phase_2+0x44> // skip explosion if not equal
400f42: e8 06 07 00 00 callq 40164d <explode_bomb> // bomb detonates (fail)
400f47: 48 83 c4 28 add $0x28,%rsp
400f4b: 5b pop %rbx
400f4c: 5d pop %rbp
400f4d: 41 5c pop %r12
400f4f: 41 5d pop %r13
400f51: c3 retq
Any help is greatly appreciated - especially advice on how I would go about translating something like this into C code. Thanks in advance!

especially advice on how I would go about translating something like this into C code
Don't literally translate it into C.
Learn to think in terms of how algorithms are implemented in terms of changes to registers and memory. C and asm are just different ways of expressing what you actually want the machine to do.
Every instruction makes a well-defined modification to the architectural state of the machine, so just follow that chain of steps and see the result. Any good debugger (e.g. gdb in layout reg mode) can show you which register was modified as you single-step. The insn ref manual (links in the x86 tag wiki) has full documentation on exactly what every instruction does.
If you're ever surprised by something, look it up. There are many SO questions from people that didn't do that, and then posted silly questions about div results when they didn't set rdx first.
You need to find connections between insns that modify or overwrite a register or memory location, and later instructions that read from that register or memory location.
You can often get clues from how a register is being used, e.g. add $0x4,%rbx is probably a pointer increment to an int *. It's rare to increment a 64bit integer by 4 if it isn't a pointer, or involved in memory addressing somehow.
If you look at surrounding code and find mov 0xc(%rbx),%eax (loading 4B from an offset from %rbx), that confirms the theory that it's a pointer.
The cmp %r12,%rbx / jcc tells you that it's also part of the loop condition, and that %r12 is the end pointer. You check it's just a simple do{}while(p < end) loop by verifying that %r12 isn't modified in the loop, and that it's initialized to something sensible before the loop.
mov $0x0,%ebp tells you that this is compiler output from -O0 or -O1, because every x86 compiler knows the "peephole" optimization that xor %ebp,%ebp is the best way to zero registers. Fortunately this looks like -O1 compiler output, so it doesn't store everything to memory after every C statement and reload after. That makes code that's hard to follow, because a value doesn't stay live in the same register for long.
If you have any specific questions about that binary bomb code, you should ask them. I just answered the "how to read asm" part.

Related

Why does GCC use additional registers for pushing values onto the stack? [duplicate]

This question already has an answer here:
Why does the x86-64 System V calling convention pass args in registers instead of just the stack?
(1 answer)
Closed 8 months ago.
This C code
void test_function(int a, int b, int c, int d) {}
int main() {
test_function(1, 2, 3, 4);
return 0;
}
gets compiled by GCC (no flags, version 12.1.1, target x86_64-redhat-linux) into
0000000000401106 <test_function>:
401106: 55 push rbp
401107: 48 89 e5 mov rbp,rsp
40110a: 89 7d fc mov DWORD PTR [rbp-0x4],edi
40110d: 89 75 f8 mov DWORD PTR [rbp-0x8],esi
401110: 89 55 f4 mov DWORD PTR [rbp-0xc],edx
401113: 89 4d f0 mov DWORD PTR [rbp-0x10],ecx
401116: 90 nop
401117: 5d pop rbp
401118: c3 ret
0000000000401119 <main>:
401119: 55 push rbp
40111a: 48 89 e5 mov rbp,rsp
40111d: b9 04 00 00 00 mov ecx,0x4
401122: ba 03 00 00 00 mov edx,0x3
401127: be 02 00 00 00 mov esi,0x2
40112c: bf 01 00 00 00 mov edi,0x1
401131: e8 d0 ff ff ff call 401106 <test_function>
401136: b8 00 00 00 00 mov eax,0x0
40113b: 5d pop rbp
40113c: c3 ret
Why are additional registers (ecx, edx, esi, edi) used as intermediary storage for values 1, 2, 3, 4 instead of putting them into rbp directly?

"as intermediary storage": You confusion seems to be this part.
The ABI specifies that these function arguments are passed in the registers you are seeing (see comments under the question). The registers are not just used as intermediary. The value are never supposed to be put on the stack at all. They stay in the register the whole time, unless the function needs to reuse the register for something else or pass on a pointer to the function parameter or something similar.
What you are seeing in test_function is just an artifact of not compiling with optimizations enabled. The mov instructions putting the registers on the stack are pointless, since nothing is done with them afterwards. The stack pointer is just immediately restored and then the function returns.
The whole function should just be a single ret instruction. See https://godbolt.org/z/qG9GjMohY where -O2 is used.
Without optimizations enabled the compiler makes no attempt to remove instructions even if they are pointless and it always stores values of variables to memory and loads them from memory again, even if they could have been held in registers. That's why it is almost always pointless to look at -O0 assembly.

The registers are used for the arguments to call the function. The standard calling convertion calls for aguments to be placed in certain register, so the code you see in main puts the arguments into those registers and the code in test_function expects them in those registers and reads them from there.
So your follow-on question might be "why is test_function copying those argument on to the stack?". That's because you're compiling without optimization, so the compiler produces inefficient code, allocation space in the stack frame for every argument and local var and copying the arguments from their input register into the stack frame as part of the function prolog. If you were to use those values in th function, you would see it reading them from the stack frame locations even though they are probably still in the registers. If you compile with -O, you'll see the compiler get rid of all this, as the stack frame is not needed.

Why for accessing elements of char array byte transffer is used

Let's consider this very simple code
int main(void)
{
char buff[500];
int i;
for (i=0; i<500; i++)
{
(buff[i])++;
}
}
So, it just goes through 500 bytes and increments it. This code was compiled using gcc on x86-64 architecture and disassembled using objdump -D utility. Looking at the disassembled code, I found out that data are transferred from memory to register byte by byte (see, movzbl instruction is used to get data from memory and mov %dl is used to store data in memory)
00000000004004ed <main>:
4004ed: 55 push %rbp
4004ee: 48 89 e5 mov %rsp,%rbp
4004f1: 48 81 ec 88 01 00 00 sub $0x188,%rsp
4004f8: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)
4004ff: eb 20 jmp 400521 <main+0x34>
400501: 8b 45 fc mov -0x4(%rbp),%eax
400504: 48 98 cltq
400506: 0f b6 84 05 00 fe ff movzbl -0x200(%rbp,%rax,1),%eax
40050d: ff
40050e: 8d 50 01 lea 0x1(%rax),%edx
400511: 8b 45 fc mov -0x4(%rbp),%eax
400514: 48 98 cltq
400516: 88 94 05 00 fe ff ff mov %dl,-0x200(%rbp,%rax,1)
40051d: 83 45 fc 01 addl $0x1,-0x4(%rbp)
400521: 81 7d fc f3 01 00 00 cmpl $0x1f3,-0x4(%rbp)
400528: 7e d7 jle 400501 <main+0x14>
40052a: c9 leaveq
40052b: c3 retq
40052c: 0f 1f 40 00 nopl 0x0(%rax)
Looks like it has some performance implications, because in that case you have to access memory 500 times to read and 500 times to store. I know that cache system will cope it somehow, but anyway.
My question is why we can't load the quadwords and just do a couple of bit operations to increase each byte of that quadword and then push it back to memory? Obviously it would require some addition logic to deal with the last part of data that is less than quadword and some additional register.But this approach would dramatically reduce number of memory accessing that is the most expensive operation. Probably I don't see some obstacles that inhibit such optimization. So, it would be great to get some explanations here.

Reason why this shouldn't be done: Imagine if char happened to be unsigned (to make overflow have defined behavior) and you had a byte 0xFF followed (or preceded, depending on endianness) by 0x1.
Incrementing a byte at a time, you'd end up with the 0xFF becoming 0x00 and the 0x01 becoming 0x02. But if you just loaded 4 or 8 bytes at a time and added 0x01010101 (or eight byte equivalent) to achieve the same result, the 0xFF would overflow into the 0x01, so you'd end up with 0x00 and 0x03, not 0x00 and 0x02.
Similar issues would typically occur with signed char too; signed overflow and truncation rules (or lack thereof) make it more complicated, but the gist is that incrementing a byte at a time limits effects to that byte, with no cross-byte "interference".

When you compile without optimization, the compiler does a more literal translation of code to assembly, part of the reason for this is so that when you step through the code in a debugger, the steps correspond to your code.
If you enable optimization then the assembly may look completely different.
Also, your program causes undefined behaviour by reading an uninitialized char.

Why assembly code is different for simple C program with different gcc version?

I'm understanding the basics of assembly and c programming.
I compiled following simple program in C,
#include <stdio.h>
int main()
{
int a;
int b;
a = 10;
b = 88
return 0;
}
Compiled with following command,
gcc -ggdb -fno-stack-protector test.c -o test
The disassembled code for above program with gcc version 4.4.7 is:
5 push %ebp
89 e5 mov %esp,%ebp
83 ec 10 sub $0x10,%esp
c7 45 f8 0a 00 00 00 movl $0xa,-0x8(%ebp)
c7 45 fc 58 00 00 00 movl $0x58,-0x4(%ebp)
b8 00 00 00 00 mov $0x0,%eax
c9 leave
c3 ret
90 nop
However disassembled code for same program with gcc version 4.3.3 is:
8d 4c 23 04 lea 0x4(%esp), %ecx
83 e4 f0 and $0xfffffff0, %esp
55 push -0x4(%ecx)
89 e5 mov %esp,%ebp
51 push %ecx
83 ec 10 sub $0x10,%esp
c7 45 f4 0a 00 00 00 00 movl $0xa, -0xc(%ebp)
c7 45 f8 58 00 00 00 00 movl $0x58, -0x8(%ebp)
b8 00 00 00 00 mov $0x0, %eax
83 c4 10 add $0x10,%esp
59 pop %ecx
5d pop %ebp
8d 61 fc lea -0x4(%ecx),%esp
c3 ret
Why there is difference in the assembly code?
As you can see in second assembled code, Why pushing %ecx on stack?
What is significance of and $0xfffffff0, %esp?
note: OS is same

Compilers are not required to produce identical assembly code for the same source code. The C standard allows the compiler to optimize the code as they see fit as long as the observable behaviour is the same. So, different compilers may generate different assembly code.
For your code, GCC 6.2 with -O3 generates just:
xor eax, eax
ret
because your code essentially does nothing. So, it's reduced to a simple return statement.

To give you some idea, how many ways exists to create valid code for particular task, I thought this example may help.
From time to time there are size coding competitions, obviously targetting Assembly programmers, as you can't compete with compiler against hand written assembly at this level at all.
The competition tasks are fairly trivial to make the entry level and total effort reasonable, with precise input and output specifications (down to single byte or pixel perfection).
So you have almost trivial exact task, human produced code (at the moment still outperforming compilers for trivial task), with single simple rule "minimal size" as a goal.
With your logic it's absolutely clear every competitor should produce the same result.
The real world answer to this is for example:
Hugi Size Coding Competition Series - Compo29 - Random Maze Builder
12 entries, size of code (in bytes): 122, 122, 128, 135, 136, 137, 147, ... 278 (!).
And I bet the first two entries, both having 122B are probably different enough (too lazy to actually check them).
Now producing valid machine code from high level programming language and by machine (compiler) is lot more complex task. And compilers can't compete with humans in reasoning, most of the "how good code is produced by c++ compiler" stems from C++ language itself being defined quite close to machine code (easy to compile) and from brute CPU power allowing the compilers to work on thousands of variants for particular code path, searching for near-optimal solution mostly by brute force.
Still the numerical "reasoning" behind the optimizers are state of art in their own way, getting to the point where human are still unreachable, but more like in their own way, just as humans can't achieve the efficiency of compilers within reasonable effort for full-sized app compilation.
At this point reasoning about some debug code being different in few helper prologue/epilogue instructions... Even if you would find difference in optimized code, and the difference being "obvious" to human, it's still quite a feat the compiler can produce at least that, as compiler has to apply universal rules on specific code, without truly understanding the context of task.

Why does initializing a variable `i` to 0 and to a large size result in the same size of the program?

There is a problem which confuses me a lot.
int main(int argc, char *argv[])
{
int i = 12345678;
return 0;
}
int main(int argc, char *argv[])
{
int i = 0;
return 0;
}
The programs have the same bytes in total. Why?
And where the literal value indeed stored? Text segment or other place?

The programs have the same bytes in total.Why?
There are two possibilities:
The compiler is optimizing out the variable. It isn't used anywhere and therefore doesn't make sense.
If 1. doesn't apply, the program sizes are equal anyway. Why shouldn't they? 0 is just as large in size as 12345678. Two variables of type T occupy the same size in memory.
And where the literal value indeed stored?
On the stack. Local variables are commonly stored on the stack.

Consider your bedroom.if you filled it with stuff or you left it empty,does that change the area of your bedroom?
the size of int is sizeof(int).it does not matter what value you store in it.

Because your program is optimized. At compile time, the compiler found out that i was useless and removed it.
If optimization didn't occurs, another simple explanation is that an int is the same size of another int.

TL;DR
First question: They're the same size because the instructions output of your program are roughly the same (more on that below). Further, they're the same size because the size(number of bytes) of your ints never change.
Second question: i variable is stored in your local variables frame which is in the function stack. The actual value you set to i is in the instructions (hardcoded) in the text-segment.
GDB and Assembly
I know you're using Windows, but consider these codes and output on Linux. I used the exactly same sources you posted.
For the first one, with i = 12345678, the actual main function is these computer instructions:
(gdb) disass main
Dump of assembler code for function main:
0x00000000004004ed <+0>: push %rbp
0x00000000004004ee <+1>: mov %rsp,%rbp
0x00000000004004f1 <+4>: mov %edi,-0x14(%rbp)
0x00000000004004f4 <+7>: mov %rsi,-0x20(%rbp)
0x00000000004004f8 <+11>:movl $0xbc614e,-0x4(%rbp)
0x00000000004004ff <+18>:mov $0x0,%eax
0x0000000000400504 <+23>:pop %rbp
0x0000000000400505 <+24>:retq
End of assembler dump.
As for the other program, with i = 0, main is:
(gdb) disass main
Dump of assembler code for function main:
0x00000000004004ed <+0>: push %rbp
0x00000000004004ee <+1>: mov %rsp,%rbp
0x00000000004004f1 <+4>: mov %edi,-0x14(%rbp)
0x00000000004004f4 <+7>: mov %rsi,-0x20(%rbp)
0x00000000004004f8 <+11>:movl $0x0,-0x4(%rbp)
0x00000000004004ff <+18>:mov $0x0,%eax
0x0000000000400504 <+23>:pop %rbp
0x0000000000400505 <+24>:retq
End of assembler dump.
The only difference between both codes is the actual value being stored in your variable. Lets go in a step by step trough these lines bellow (my computer is x86_64, so if your architecture is different, instructions may differ).
OPCODES
And the actual instructions of main (using objdump):
00000000004004ed <main>:
4004ed: 55 push %rbp
4004ee: 48 89 e5 mov %rsp,%rbp
4004f1: 89 7d ec mov %edi,-0x14(%rbp)
4004f4: 48 89 75 e0 mov %rsi,-0x20(%rbp)
4004f8: c7 45 fc 4e 61 bc 00 movl $0xbc614e,-0x4(%rbp)
4004ff: b8 00 00 00 00 mov $0x0,%eax
400504: 5d pop %rbp
400505: c3 retq
400506: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
40050d: 00 00 00
To get the actual difference of bytes, using objdump -D prog1 > prog1_dump and objdump -D prog2 > prog2_dump and them diff prog1_dump prog2_dump:
2c2
< draft1: file format elf64-x86-64
---
> draft2: file format elf64-x86-64
51,58c51,58
< 400283: 00 bc f6 06 64 9f ba add %bh,-0x45609bfa(%rsi,%rsi,8)
< 40028a: 01 3b add %edi,(%rbx)
< 40028c: 14 d1 adc $0xd1,%al
< 40028e: 12 cf adc %bh,%cl
< 400290: cd 2e int $0x2e
< 400292: 11 77 5d adc %esi,0x5d(%rdi)
< 400295: 79 fe jns 400295 <_init-0x113>
< 400297: 3b .byte 0x3b
---
> 400283: 00 e8 add %ch,%al
> 400285: f1 icebp
> 400286: 6e outsb %ds:(%rsi),(%dx)
> 400287: 8a f8 mov %al,%bh
> 400289: a8 05 test $0x5,%al
> 40028b: ab stos %eax,%es:(%rdi)
> 40028c: 48 2d 3f e9 e2 b2 sub $0xffffffffb2e2e93f,%rax
> 400292: f7 06 53 df ba af testl $0xafbadf53,(%rsi)
287c287
< 4004f8: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)
---
> 4004f8: c7 45 fc 4e 61 bc 00 movl $0xbc614e,-0x4(%rbp)
Note on address 0x4004f8 your number is there, 4e 61 bc 00 on prog2 and 00 00 00 00 on prog1, both 4 bytes which is equal to sizeof(int). The bytes c7 45 fc are the rest of the instructions (move some value into an offset of rbp). Also note that the first two sections that differ have the same size in bytes (21). So, there you go, although slightly different, they're the same size.
Step by step through Assembly Instructions
push %rbp; mov %rsp, %rbp: This is called setting up the Stack Frame, and is standard for all C functions (unless you tell gcc -fomit-frame-pointer). This enables you to access the stack and your local variables through a fixed register, in this case, rbp.
mov %edi, -0x14(%rbp): This moves the content of register edi into our local variables frame. Specifically, into offset -0x14
mov %rsi, -0x20(%rbp): Same here. But this time it saves rsi. This is part of the x86_64 calling convention (which now uses registers instead of pushing everything on stack like x86_32), but instead of keeping them in registers, we free the registers by saving the contents in our local variables frame - register are way faster and are the only way the CPU can actually process anything, so the more free registers we have, the better.
Note: edi is the 4-bytes part of the rsi register and from the x86_64 calling convention, we know that rsi register is used for the first argument. main's first argument is int argc, so it makes sense we use a 4-byte register to store it. rsi is the second argument, effectively the address of a pointer to pointer to chars (**argv). So, in 64bit architectures, that fits perfectly into a 64bit register.
<+11>: movl $0xbc614e,-0x4(%rbp): This is the actual line int i = 12345678 (0xbc614e = 12345678d). Now, note that the way we "move" that value is very similar to how we stored the main arguments. We use offset -0x4(%rbp) to store it memory, on the "local variables frame" (this answers your question on where it gets stored).
mov $0x0, %eax; pop %rbp; retq: Again, dull stuff to clear up the frame pointer and return (end the program since we're in main).
Note that on the second example, the only difference is the line <+11>: movl $0x0,-0x4(%rbp), which effectively stores the value zero - in C words, int i = 0.
So, by these instructions you can see that the main function of both programs gets translated to assembly in the exact the same way, so their sizes are the same in the end. (Assuming you compiled them the same way, because the compiler also puts lots of other things in the binaries, like data, library functions, etc. In linux, you can get a full disassembly dump using objdump -D program.
Note 2: In these examples, you cannot see how the computer subtracts values from rsp in order to allocate stack space, but that's how it's normally done.
Stack Representation
The stack would be like this for both cases (only the value of i would change, or the value at -0x4(%rbp))
| ~~~ | Higher Memory addresses
| |
+------------------+ <--- Address 0x8(%rbp)
| RETURN ADDRESS |
+------------------+ <--- Address 0x0(%rbp) // instruction push %rbp
| previous rbp |
+------------------+ <--- Address -0x4(%rbp)
| i=0x11223344 |
+------------------+ <---- Address -0x14(%rbp)
| argc |
+------------------+ <---- address -0x20(%rbp)
| argv |
+------------------+
| |
+~~~~~~~~~~~~~~~~~~+ Lower memory addresses
Note 3: The direction to where the stack grows depends on your architecture. How data gets written in memory also depends on your architecture.
Resources
What are the calling conventions for UNIX & Linux system calls on x86-64
Call Stack
GCC Optimization Options
Understanding the Stack
How does the stack work in assembly language?
x86_64 : is stack frame pointer almost useless?

Assembly - why is %rsp decremented by so much, and why are arguments stored at the top of the stack?

Assembly newbie here... I wrote the following simple C program:
void fun(int x, int* y)
{
char arr[4];
int* sp;
sp = y;
}
int main()
{
int i = 4;
fun(i, &i);
return 0;
}
I compiled it with gcc and ran objdump with -S, but the Assembly code output is confusing me:
000000000040055d <fun>:
void fun(int x, int* y)
{
40055d: 55 push %rbp
40055e: 48 89 e5 mov %rsp,%rbp
400561: 48 83 ec 30 sub $0x30,%rsp
400565: 89 7d dc mov %edi,-0x24(%rbp)
400568: 48 89 75 d0 mov %rsi,-0x30(%rbp)
40056c: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
400573: 00 00
400575: 48 89 45 f8 mov %rax,-0x8(%rbp)
400579: 31 c0 xor %eax,%eax
char arr[4];
int* sp;
sp = y;
40057b: 48 8b 45 d0 mov -0x30(%rbp),%rax
40057f: 48 89 45 e8 mov %rax,-0x18(%rbp)
}
400583: 48 8b 45 f8 mov -0x8(%rbp),%rax
400587: 64 48 33 04 25 28 00 xor %fs:0x28,%rax
40058e: 00 00
400590: 74 05 je 400597 <fun+0x3a>
400592: e8 a9 fe ff ff callq 400440 <__stack_chk_fail#plt>
400597: c9 leaveq
400598: c3 retq
0000000000400599 <main>:
int main()
{
400599: 55 push %rbp
40059a: 48 89 e5 mov %rsp,%rbp
40059d: 48 83 ec 10 sub $0x10,%rsp
int i = 4;
4005a1: c7 45 fc 04 00 00 00 movl $0x4,-0x4(%rbp)
fun(i, &i);
4005a8: 8b 45 fc mov -0x4(%rbp),%eax
4005ab: 48 8d 55 fc lea -0x4(%rbp),%rdx
4005af: 48 89 d6 mov %rdx,%rsi
4005b2: 89 c7 mov %eax,%edi
4005b4: e8 a4 ff ff ff callq 40055d <fun>
return 0;
4005b9: b8 00 00 00 00 mov $0x0,%eax
}
4005be: c9 leaveq
4005bf: c3 retq
First, in the line:
400561: 48 83 ec 30 sub $0x30,%rsp
Why is the stack pointer decremented so much in the call to 'fun' (48 bytes)? I assume it has to do with alignment issues, but I cannot visualize why it would need so much space (I only count 12 bytes for local variables (assuming 8 byte pointers))?
Second, I thought that in x86_64, the arguments to a function are either stored in specific registers, or if there are a lot of them, just 'above' (with a downward growing stack) the base pointer, %rbp. Like in the picture at http://en.wikipedia.org/wiki/Call_stack#Structure except 'upside-down'.
But the lines:
400565: 89 7d dc mov %edi,-0x24(%rbp)
400568: 48 89 75 d0 mov %rsi,-0x30(%rbp)
suggest to me that they are being stored way down from the base of the stack (%rsi and %edi are where main put the arguments, right before calling 'fun', and 0x30 down from %rbp is exactly where the stack pointer is pointing...). And when I try to do stuff with them , like assigning their values to local variables, it grabs them from those locations near the head of the stack:
sp = y;
40057b: 48 8b 45 d0 mov -0x30(%rbp),%rax
40057f: 48 89 45 e8 mov %rax,-0x18(%rbp)
... what is going on here?! I would expect them to grab the arguments from either the registers they were stored in, or just above the base pointer, where I thought they are 'supposed to be', according to every basic tutorial I read. Every answer and post I found on here related to stack frame questions confirms my understanding of what stack frames "should" look like, so why is my Assembly output so darn weird?

Because that stuff is a hideously simplified version of what really goes on. It's like wondering why Newtonian mechanics doesn't model the movement of the planets down to the millimeter. Compilers need stack space for all sorts of things. For example, saving callee-saved registers.
Also, the fundamental fact is that debug-mode compilations contain all sorts of debugging and checking machinery. The compiler outputs all sorts of code that checks that your code is correct, for example the call to __stack_chk_fail.
There are only two ways to understand the output of a given compiler. The first is to implement the compiler, or be otherwise very familiar with the implementation. The second is to accept that whatever you understand is a gross simplification. Pick one.

Because you're compiling without optimization, the compiler does lots of extra stuff to maybe make things easier to debug, which use lots of extra space.
it does not attempt to compress the stack frame to reuse memory for anything, or get rid of any unused things.
it redundantly copies the arguments into the stack frame (which requires still more memory)
it copies a 'canary' on to the stack to guard against stack smashing buffer overflows (even though they can't happen in this code).
Try turning on optimization, and you'll see more real code.

This is 64 bit code. 0x30 of stack space corresponds to 6 slots on the stack. You have what appears to be:
2 slots for function arguments (which happen also to be passed in registers)
2 slots for local variables
1 slot for saving the AX register
1 slot looks like a stack guard, probably related to DEBUG mode.
Best thing is to experiment rather than ask questions. Try compiling in different modes (DEBUG, optimisation, etc), and with different numbers and types of arguments and variables. Sometimes asking other people is just too easy -- you learn better by doing your own experiments.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight