Why for accessing elements of char array byte transffer is used - c

Let's consider this very simple code
int main(void)
{
char buff[500];
int i;
for (i=0; i<500; i++)
{
(buff[i])++;
}
}
So, it just goes through 500 bytes and increments it. This code was compiled using gcc on x86-64 architecture and disassembled using objdump -D utility. Looking at the disassembled code, I found out that data are transferred from memory to register byte by byte (see, movzbl instruction is used to get data from memory and mov %dl is used to store data in memory)
00000000004004ed <main>:
4004ed: 55 push %rbp
4004ee: 48 89 e5 mov %rsp,%rbp
4004f1: 48 81 ec 88 01 00 00 sub $0x188,%rsp
4004f8: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)
4004ff: eb 20 jmp 400521 <main+0x34>
400501: 8b 45 fc mov -0x4(%rbp),%eax
400504: 48 98 cltq
400506: 0f b6 84 05 00 fe ff movzbl -0x200(%rbp,%rax,1),%eax
40050d: ff
40050e: 8d 50 01 lea 0x1(%rax),%edx
400511: 8b 45 fc mov -0x4(%rbp),%eax
400514: 48 98 cltq
400516: 88 94 05 00 fe ff ff mov %dl,-0x200(%rbp,%rax,1)
40051d: 83 45 fc 01 addl $0x1,-0x4(%rbp)
400521: 81 7d fc f3 01 00 00 cmpl $0x1f3,-0x4(%rbp)
400528: 7e d7 jle 400501 <main+0x14>
40052a: c9 leaveq
40052b: c3 retq
40052c: 0f 1f 40 00 nopl 0x0(%rax)
Looks like it has some performance implications, because in that case you have to access memory 500 times to read and 500 times to store. I know that cache system will cope it somehow, but anyway.
My question is why we can't load the quadwords and just do a couple of bit operations to increase each byte of that quadword and then push it back to memory? Obviously it would require some addition logic to deal with the last part of data that is less than quadword and some additional register.But this approach would dramatically reduce number of memory accessing that is the most expensive operation. Probably I don't see some obstacles that inhibit such optimization. So, it would be great to get some explanations here.

Reason why this shouldn't be done: Imagine if char happened to be unsigned (to make overflow have defined behavior) and you had a byte 0xFF followed (or preceded, depending on endianness) by 0x1.
Incrementing a byte at a time, you'd end up with the 0xFF becoming 0x00 and the 0x01 becoming 0x02. But if you just loaded 4 or 8 bytes at a time and added 0x01010101 (or eight byte equivalent) to achieve the same result, the 0xFF would overflow into the 0x01, so you'd end up with 0x00 and 0x03, not 0x00 and 0x02.
Similar issues would typically occur with signed char too; signed overflow and truncation rules (or lack thereof) make it more complicated, but the gist is that incrementing a byte at a time limits effects to that byte, with no cross-byte "interference".

When you compile without optimization, the compiler does a more literal translation of code to assembly, part of the reason for this is so that when you step through the code in a debugger, the steps correspond to your code.
If you enable optimization then the assembly may look completely different.
Also, your program causes undefined behaviour by reading an uninitialized char.

Related

Type casting of Macro to optimize the code

Working on to optimize the code. Is it good idea to type cast the macro to char to reduce the memory consumption? What could be the side effect of doing this?
Example:
#define TRUE 1 //non-optimized code
sizeof(TRUE) --> 4
#define TRUE 1 ((char) 0x01) //To optimize
sizeof(TRUE) --> 1
#define MAX 10 //non-optimized code
sizeof(MAX) --> 4
#define MAX ((char) 10) //To optimize
sizeof(MAX) --> 1
They will make virtually no difference in memory consumption.
These macros provide values to be used in expressions, while the actual memory usage is (roughly) dictated by the type and number of variables and dynamically allocated memory. So, you may have TRUE as an int or as a char, but what actually matters is the type of variable it (or, the expression in which it appears) gets assigned to, which is not influenced by the type of the constant.
The only influence the type of these constants may have is in how the expressions they are used into are carried out - but even that effect should be almost non existant, given that the C standard (simplifying) implicitly promotes to int or unsigned all the smaller types before carrying out almost any operation.1
So: if you want to reduce your memory consumption, don't look at your constants, but at your data structures, possibly global and dynamically-allocates ones2! Maybe you have a huge array of double values where the precision of float would be enough, maybe you are keeping around big data longer than you need it, or you have memory leaks, or a big array of a badly-laid-out struct, or of booleans that are 4-byte wide when they could be a bitfield - this is the kind of thing you should look after, definitely not these #defines.
Notes
The idea being that integral operations are carried out at the native register size, which traditionally corresponds to int. Besides, even if this rule wasn't true, the only memory effect of changing the size of integral temporary values in expressions may be at most to increase a bit the stack usage (which is generally mostly preallocated anyway) in case of heavy register spilling.
What is allocated on the stack generally isn't problematic - as said above, it's generally preallocated, and if you were exhausting it your program would be already crashing.
There is no such thing as a char constant in C, which is why there are no suffixes for "short" and "char", as there are for "long" and "long long". The casted value of (char)0x10 will immediately be promoted back to an int in almost any integer context, because of the integer promotions (§6.3.1.1p2).
So if c is a char and you write if (c == (char)0x10) ...,
both x and (char)0x10 are promoted to int before being compared.
Of course, a given compiler might elide the conversion if it knows that it makes no difference, but that compiler would certainly also use a byte constant if possible even without the explicit cast.
The optimization level depends on (1) where those defines are used and (2) what is the processor's arquitecture (or microcontroller) you're running the code.
The (1) has already been addressed in other answers.
The (2) is importante because there are processors/microcontrollers that perform better with 8 bits instead of 32 bits. There are processors that are, for example, 16 bits and if you use 8 bits variables it could decrease the memory needed but increase the run time of the program.
Below are an example and its disassemble:
#include <stdint.h>
#define _VAR_UINT8 ((uint8_t) -1)
#define _VAR_UINT16 ((uint16_t) -1)
#define _VAR_UINT32 ((uint32_t) -1)
#define _VAR_UINT64 ((uint64_t) -1)
volatile uint8_t v1b;
volatile uint16_t v2b;
volatile uint32_t v4b;
volatile uint64_t v8b;
int main(void) {
v1b = _VAR_UINT8;
v2b = _VAR_UINT8;
v2b = _VAR_UINT16;
v4b = _VAR_UINT8;
v4b = _VAR_UINT16;
v4b = _VAR_UINT32;
v8b = _VAR_UINT8;
v8b = _VAR_UINT16;
v8b = _VAR_UINT32;
v8b = _VAR_UINT64;
return 0;
}
Below are the disassemble for a x86 32 bit specific platform (it could be differente if you compile the above code and generate the disassemble in our processor)
00000000004004ec <main>:
4004ec: 55 push %rbp
4004ed: 48 89 e5 mov %rsp,%rbp
4004f0: c6 05 49 0b 20 00 ff movb $0xff,0x200b49(%rip) # 601040 <v1b>
4004f7: 66 c7 05 48 0b 20 00 movw $0xff,0x200b48(%rip) # 601048 <v2b>
4004fe: ff 00
400500: 66 c7 05 3f 0b 20 00 movw $0xffff,0x200b3f(%rip) # 601048 <v2b>
400507: ff ff
400509: c7 05 31 0b 20 00 ff movl $0xff,0x200b31(%rip) # 601044 <v4b>
400510: 00 00 00
400513: c7 05 27 0b 20 00 ff movl $0xffff,0x200b27(%rip) # 601044 <v4b>
40051a: ff 00 00
40051d: c7 05 1d 0b 20 00 ff movl $0xffffffff,0x200b1d(%rip) # 601044 <v4b>
400524: ff ff ff
400527: 48 c7 05 06 0b 20 00 movq $0xff,0x200b06(%rip) # 601038 <v8b>
40052e: ff 00 00 00
400532: 48 c7 05 fb 0a 20 00 movq $0xffff,0x200afb(%rip) # 601038 <v8b>
400539: ff ff 00 00
40053d: c7 05 f1 0a 20 00 ff movl $0xffffffff,0x200af1(%rip) # 601038 <v8b>
400544: ff ff ff
400547: c7 05 eb 0a 20 00 00 movl $0x0,0x200aeb(%rip) # 60103c <v8b+0x4>
40054e: 00 00 00
400551: 48 c7 05 dc 0a 20 00 movq $0xffffffffffffffff,0x200adc(%rip) # 601038 <v8b>
400558: ff ff ff ff
40055c: b8 00 00 00 00 mov $0x0,%eax
400561: 5d pop %rbp
400562: c3 retq
400563: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
40056a: 00 00 00
40056d: 0f 1f 00 nopl (%rax)
In my specific platform it is using 4 types of mov instruction, movb (7 bytes), movw (9 bytes), movl (10 bytes) and movq (12 bytes) depending upon the variable type and also the data type to be assigned.

Reading x86 assembly code

I am working through a lab where I have to defuse a "bomb" by providing the correct input for each phase. I do not have access to the source code, so I have to step through the assembly code with GDB. Right now, I'm stuck on phase 2 and would really appreciate some help. Here is the x86 assembly code - I've added some comments that describe what I think is happening, but these could be horribly wrong because we only started learning assembly code a few days ago and I'm still quite shaky. As far as I can tell right now, this phase reads in 6 numbers from the user (that's what read_six_numbers does) and seems to go through some type of loop.
0000000000400f03 <phase_2>:
400f03: 41 55 push %r13 // save values
400f05: 41 54 push %r12
400f07: 55 push %rbp
400f08: 53 push %rbx
400f09: 48 83 ec 28 sub $0x28,%rsp // decrease stack pointer
400f0d: 48 89 e6 mov %rsp,%rsi // move rsp to rsi
400f10: e8 5a 07 00 00 callq 40166f <read_six_numbers> // read in six numbers from the user
400f15: 48 89 e3 mov %rsp,%rbx // move rsp to rbx
400f18: 4c 8d 64 24 0c lea 0xc(%rsp),%r12 // ?
400f1d: bd 00 00 00 00 mov $0x0,%ebp // set ebp to 0?
400f22: 49 89 dd mov %rbx,%r13 // move rbx to r13
400f25: 8b 43 0c mov 0xc(%rbx),%eax // ?
400f28: 39 03 cmp %eax,(%rbx) // compare eax and rbx
400f2a: 74 05 je 400f31 <phase_2+0x2e> // if equal, skip explode
400f2c: e8 1c 07 00 00 callq 40164d <explode_bomb> // bomb detonates (fail)
400f31: 41 03 6d 00 add 0x0(%r13),%ebp // add r13 and ebp (?)
400f35: 48 83 c3 04 add $0x4,%rbx // add 4 to rbx
400f39: 4c 39 e3 cmp %r12,%rbx // compare r12 and rbx
400f3c: 75 e4 jne 400f22 <phase_2+0x1f> // loop? if not equal, jump to 400f22
400f3e: 85 ed test %ebp,%ebp // compare ebp with itself?
400f40: 75 05 jne 400f47 <phase_2+0x44> // skip explosion if not equal
400f42: e8 06 07 00 00 callq 40164d <explode_bomb> // bomb detonates (fail)
400f47: 48 83 c4 28 add $0x28,%rsp
400f4b: 5b pop %rbx
400f4c: 5d pop %rbp
400f4d: 41 5c pop %r12
400f4f: 41 5d pop %r13
400f51: c3 retq
Any help is greatly appreciated - especially advice on how I would go about translating something like this into C code. Thanks in advance!
especially advice on how I would go about translating something like this into C code
Don't literally translate it into C.
Learn to think in terms of how algorithms are implemented in terms of changes to registers and memory. C and asm are just different ways of expressing what you actually want the machine to do.
Every instruction makes a well-defined modification to the architectural state of the machine, so just follow that chain of steps and see the result. Any good debugger (e.g. gdb in layout reg mode) can show you which register was modified as you single-step. The insn ref manual (links in the x86 tag wiki) has full documentation on exactly what every instruction does.
If you're ever surprised by something, look it up. There are many SO questions from people that didn't do that, and then posted silly questions about div results when they didn't set rdx first.
You need to find connections between insns that modify or overwrite a register or memory location, and later instructions that read from that register or memory location.
You can often get clues from how a register is being used, e.g. add $0x4,%rbx is probably a pointer increment to an int *. It's rare to increment a 64bit integer by 4 if it isn't a pointer, or involved in memory addressing somehow.
If you look at surrounding code and find mov 0xc(%rbx),%eax (loading 4B from an offset from %rbx), that confirms the theory that it's a pointer.
The cmp %r12,%rbx / jcc tells you that it's also part of the loop condition, and that %r12 is the end pointer. You check it's just a simple do{}while(p < end) loop by verifying that %r12 isn't modified in the loop, and that it's initialized to something sensible before the loop.
mov $0x0,%ebp tells you that this is compiler output from -O0 or -O1, because every x86 compiler knows the "peephole" optimization that xor %ebp,%ebp is the best way to zero registers. Fortunately this looks like -O1 compiler output, so it doesn't store everything to memory after every C statement and reload after. That makes code that's hard to follow, because a value doesn't stay live in the same register for long.
If you have any specific questions about that binary bomb code, you should ask them. I just answered the "how to read asm" part.

Try to understand calling process in assembly code

I wrote a very simple program in C and try to understand the function calling process.
#include "stdio.h"
void Oh(unsigned x) {
printf("%u\n", x);
}
int main(int argc, char const *argv[])
{
Oh(0x67611c8c);
return 0;
}
And its assembly code seems to be
0000000100000f20 <_Oh>:
100000f20: 55 push %rbp
100000f21: 48 89 e5 mov %rsp,%rbp
100000f24: 48 83 ec 10 sub $0x10,%rsp
100000f28: 48 8d 05 6b 00 00 00 lea 0x6b(%rip),%rax # 100000f9a <_printf$stub+0x20>
100000f2f: 89 7d fc mov %edi,-0x4(%rbp)
100000f32: 8b 75 fc mov -0x4(%rbp),%esi
100000f35: 48 89 c7 mov %rax,%rdi
100000f38: b0 00 mov $0x0,%al
100000f3a: e8 3b 00 00 00 callq 100000f7a <_printf$stub>
100000f3f: 89 45 f8 mov %eax,-0x8(%rbp)
100000f42: 48 83 c4 10 add $0x10,%rsp
100000f46: 5d pop %rbp
100000f47: c3 retq
100000f48: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
100000f4f: 00
0000000100000f50 <_main>:
100000f50: 55 push %rbp
100000f51: 48 89 e5 mov %rsp,%rbp
100000f54: 48 83 ec 10 sub $0x10,%rsp
100000f58: b8 8c 1c 61 67 mov $0x67611c8c,%eax
100000f5d: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)
100000f64: 89 7d f8 mov %edi,-0x8(%rbp)
100000f67: 48 89 75 f0 mov %rsi,-0x10(%rbp)
100000f6b: 89 c7 mov %eax,%edi
100000f6d: e8 ae ff ff ff callq 100000f20 <_Oh>
100000f72: 31 c0 xor %eax,%eax
100000f74: 48 83 c4 10 add $0x10,%rsp
100000f78: 5d pop %rbp
100000f79: c3 retq
Well, I don't quite understand the argument passing process, since there is only one parameter passed to Oh function, I could under stand this
100000f58: b8 8c 1c 61 67 mov $0x67611c8c,%eax
So what does the the code below do? Why rbp? Isn't it abandoned in X86-64 assembly? If it is a x86 style assembly, how can I generate the x86-64 style assembly using clang? If it is x86, it doesn't matter, could any one explains the below code line by line for me?
100000f5d: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%rbp)
100000f64: 89 7d f8 mov %edi,-0x8(%rbp)
100000f67: 48 89 75 f0 mov %rsi,-0x10(%rbp)
100000f6b: 89 c7 mov %eax,%edi
100000f6d: e8 ae ff ff ff callq 100000f20 <_Oh>
You might get cleaner code if you turned optimizations on, or you might not. But, here’s what that does.
The %rbp register is being used as a frame pointer, that is, a pointer to the original top of the stack. It’s saved on the stack, stored, and restored at the end. Far from being removed in x86_64, it was added there; the 32-bit equivalent was %ebp.
After this value is saved, the program allocates sixteen bytes off the stack by subtracting from the stack pointer.
There then is a very inefficient series of copies that sets the first argument of Oh() as the second argument of printf() and the constant address of the format string (relative to the instruction pointer) as the first argument of printf(). Remember that, in this calling convention, the first argument is passed in %rdi (or %edi for 32-bit operands) and the second in %rsi This could have been simplified to two instructions.
After calling printf(), the program (needlessly) saves the return value on the stack, restores the stack and frame pointers, and returns.
In main(), there’s similar code to set up the stack frame, then the program saves argc and argv (needlessly), then it moves around the constant argument to Oh into its first argument, by way of %eax. This could have been optimized into a single instruction. It then calls Oh(). On return, it sets its return value to 0, cleans up the stack, and returns.
The code you’re asking about does the following: stores the constant 32-bit value 0 on the stack, saves the 32-bit value argc on the stack, saves the 64-bit pointer argv on the stack (the first and second arguments to main()), and sets the first argument of the function it is about to call to %eax, which it had previously loaded with a constant. This is all unnecessary for this program, but would have been necessary had it needed to use argc and argv after the call, when those registers would have been clobbered. There’s no good reason it used two steps to load the constant instead of one.
As Jester mentions you still have frame pointers on (to aid debugging)so stepping through main:
0000000100000f50 <_main>:
First we enter a new stack frame, we have to save the base pointer and move the stack to the new base. Also, in x86_64 the stack frame has to be aligned to a 16 byte boundary (hence moving the stack pointer by 0x10).
100000f50: push %rbp
100000f51: mov %rsp,%rbp
100000f54: sub $0x10,%rsp
As you mention, x86_64 passes parameters by register, so load the param in to the register:
100000f58: mov $0x67611c8c,%eax
??? Help needed
100000f5d: movl $0x0,-0x4(%rbp)
From here: "Registers RBP, RBX, and R12-R15 are callee-save registers", so if we want to save other resisters then we have to do it ourselves ....
100000f64: mov %edi,-0x8(%rbp)
100000f67: mov %rsi,-0x10(%rbp)
Not really sure why we didn't just load this in %edi where it needs to be for the call to begin with, but we better move it there now.
100000f6b: mov %eax,%edi
Call the function:
100000f6d: callq 100000f20 <_Oh>
This is the return value (passed in %eax), xor is a smaller instruction than load 0, so is a cmmon optimization:
100000f72: xor %eax,%eax
Clean up that stack frame we added earlier (not really sure why we saved those registers on it when we didn't use them)
100000f74: add $0x10,%rsp
100000f78: pop %rbp
100000f79: retq

Does anyone know why gcc 4.8.4 optimizes this code in a infinite loop?

I find very strange the differences between the assembler results of the following code compiled without optimization and with -Os optimization.
#include <stdio.h>
int main(){
int i;
for(i=3;i>2;i++);
printf("%d\n",i);
return 0;
}
Without optimization the code results:
000000000040052d <main>:
40052d: 55 push %rbp
40052e: 48 89 e5 mov %rsp,%rbp
400531: 48 83 ec 10 sub $0x10,%rsp
400535: c7 45 fc 03 00 00 00 movl $0x3,-0x4(%rbp)
40053c: c7 45 fc 03 00 00 00 movl $0x3,-0x4(%rbp)
400543: eb 04 jmp 400549 <main+0x1c>
400545: 83 45 fc 01 addl $0x1,-0x4(%rbp)
400549: 83 7d fc 02 cmpl $0x2,-0x4(%rbp)
40054d: 7f f6 jg 400545 <main+0x18>
40054f: 8b 45 fc mov -0x4(%rbp),%eax
400552: 89 c6 mov %eax,%esi
400554: bf f4 05 40 00 mov $0x4005f4,%edi
400559: b8 00 00 00 00 mov $0x0,%eax
40055e: e8 ad fe ff ff callq 400410 <printf#plt>
400563: b8 00 00 00 00 mov $0x0,%eax
400568: c9 leaveq
400569: c3 retq
and the output is: -2147483648 (as I expect on a PC)
With -Os the code results:
0000000000400400 <main>:
400400: eb fe jmp 400400 <main>
I think the second result is an error!!! I think the compiler should have compiled something corresponding to the code:
printf("%d\n",-2147483648);
Compiler is working as it should.
Signed integer overflow is illegal in C, and results in undefined behaviour. Any program that relies on it is broken.
Compiler replaces for(i=3;i>2;i++); with while(1);, because it sees that i starts from 3 and only increases, so value can never be less than 3.
Only overflow could result in loop exit. But that is illegal and compiler assumes that you would never do such a dirty thing.
Because there is infinite loop, printf is never reached and can be removed.
Unoptimized version worked only by accident. Compiler could have done the same thing there and it would have been equally valid.
Well, the compiler is allowed to assume that the program will never exhibit undefined behaviour.
You get INT_MIN in the first case, because you have an overflow when INT_MAX + 1 gives INT_MIN (*), but this is undefined behaviour. And the C99 draft (n1556) says at 6.5 Expressions §5: If an exceptional condition occurs during the evaluation of an expression (that is, if the
result is not mathematically defined or not in the range of representable values for its type), the behavior is undefined.
So compiler can say:
loop starts with an index value greater than the limit
index is always increased
if no UB occurs, index will always be greater than the limit => this is an infinite loop
With the as-if rule (5.1.2.3 Program execution §3 An
actual implementation need not evaluate part of an expression if it can deduce that its
value is not used and that no needed side effects are produced), it can replace your loop with an infinite loop. Following instructions can no longer be reached and can be removed.
You invoked undefined behaviour and got... undefined behaviour.
(*) and even this is plainly implementation dependant, INT_MIN could be -2147483647if you had 1's complement, 8000000 could be a negative 0, or overflow could raise a signal...

Assembly - why is %rsp decremented by so much, and why are arguments stored at the top of the stack?

Assembly newbie here... I wrote the following simple C program:
void fun(int x, int* y)
{
char arr[4];
int* sp;
sp = y;
}
int main()
{
int i = 4;
fun(i, &i);
return 0;
}
I compiled it with gcc and ran objdump with -S, but the Assembly code output is confusing me:
000000000040055d <fun>:
void fun(int x, int* y)
{
40055d: 55 push %rbp
40055e: 48 89 e5 mov %rsp,%rbp
400561: 48 83 ec 30 sub $0x30,%rsp
400565: 89 7d dc mov %edi,-0x24(%rbp)
400568: 48 89 75 d0 mov %rsi,-0x30(%rbp)
40056c: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax
400573: 00 00
400575: 48 89 45 f8 mov %rax,-0x8(%rbp)
400579: 31 c0 xor %eax,%eax
char arr[4];
int* sp;
sp = y;
40057b: 48 8b 45 d0 mov -0x30(%rbp),%rax
40057f: 48 89 45 e8 mov %rax,-0x18(%rbp)
}
400583: 48 8b 45 f8 mov -0x8(%rbp),%rax
400587: 64 48 33 04 25 28 00 xor %fs:0x28,%rax
40058e: 00 00
400590: 74 05 je 400597 <fun+0x3a>
400592: e8 a9 fe ff ff callq 400440 <__stack_chk_fail#plt>
400597: c9 leaveq
400598: c3 retq
0000000000400599 <main>:
int main()
{
400599: 55 push %rbp
40059a: 48 89 e5 mov %rsp,%rbp
40059d: 48 83 ec 10 sub $0x10,%rsp
int i = 4;
4005a1: c7 45 fc 04 00 00 00 movl $0x4,-0x4(%rbp)
fun(i, &i);
4005a8: 8b 45 fc mov -0x4(%rbp),%eax
4005ab: 48 8d 55 fc lea -0x4(%rbp),%rdx
4005af: 48 89 d6 mov %rdx,%rsi
4005b2: 89 c7 mov %eax,%edi
4005b4: e8 a4 ff ff ff callq 40055d <fun>
return 0;
4005b9: b8 00 00 00 00 mov $0x0,%eax
}
4005be: c9 leaveq
4005bf: c3 retq
First, in the line:
400561: 48 83 ec 30 sub $0x30,%rsp
Why is the stack pointer decremented so much in the call to 'fun' (48 bytes)? I assume it has to do with alignment issues, but I cannot visualize why it would need so much space (I only count 12 bytes for local variables (assuming 8 byte pointers))?
Second, I thought that in x86_64, the arguments to a function are either stored in specific registers, or if there are a lot of them, just 'above' (with a downward growing stack) the base pointer, %rbp. Like in the picture at http://en.wikipedia.org/wiki/Call_stack#Structure except 'upside-down'.
But the lines:
400565: 89 7d dc mov %edi,-0x24(%rbp)
400568: 48 89 75 d0 mov %rsi,-0x30(%rbp)
suggest to me that they are being stored way down from the base of the stack (%rsi and %edi are where main put the arguments, right before calling 'fun', and 0x30 down from %rbp is exactly where the stack pointer is pointing...). And when I try to do stuff with them , like assigning their values to local variables, it grabs them from those locations near the head of the stack:
sp = y;
40057b: 48 8b 45 d0 mov -0x30(%rbp),%rax
40057f: 48 89 45 e8 mov %rax,-0x18(%rbp)
... what is going on here?! I would expect them to grab the arguments from either the registers they were stored in, or just above the base pointer, where I thought they are 'supposed to be', according to every basic tutorial I read. Every answer and post I found on here related to stack frame questions confirms my understanding of what stack frames "should" look like, so why is my Assembly output so darn weird?
Because that stuff is a hideously simplified version of what really goes on. It's like wondering why Newtonian mechanics doesn't model the movement of the planets down to the millimeter. Compilers need stack space for all sorts of things. For example, saving callee-saved registers.
Also, the fundamental fact is that debug-mode compilations contain all sorts of debugging and checking machinery. The compiler outputs all sorts of code that checks that your code is correct, for example the call to __stack_chk_fail.
There are only two ways to understand the output of a given compiler. The first is to implement the compiler, or be otherwise very familiar with the implementation. The second is to accept that whatever you understand is a gross simplification. Pick one.
Because you're compiling without optimization, the compiler does lots of extra stuff to maybe make things easier to debug, which use lots of extra space.
it does not attempt to compress the stack frame to reuse memory for anything, or get rid of any unused things.
it redundantly copies the arguments into the stack frame (which requires still more memory)
it copies a 'canary' on to the stack to guard against stack smashing buffer overflows (even though they can't happen in this code).
Try turning on optimization, and you'll see more real code.
This is 64 bit code. 0x30 of stack space corresponds to 6 slots on the stack. You have what appears to be:
2 slots for function arguments (which happen also to be passed in registers)
2 slots for local variables
1 slot for saving the AX register
1 slot looks like a stack guard, probably related to DEBUG mode.
Best thing is to experiment rather than ask questions. Try compiling in different modes (DEBUG, optimisation, etc), and with different numbers and types of arguments and variables. Sometimes asking other people is just too easy -- you learn better by doing your own experiments.

Resources