About gcc-compiled x86_64 code and C code optimization

About gcc-compiled x86_64 code and C code optimization - c

I compiled the following C code:
typedef struct {
long x, y, z;
} Foo;
long Bar(Foo *f, long i)
{
return f[i].x + f[i].y + f[i].z;
}
with the command gcc -S -O3 test.c. Here is the Bar function in the output:
.section __TEXT,__text,regular,pure_instructions
.globl _Bar
.align 4, 0x90
_Bar:
Leh_func_begin1:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
leaq (%rsi,%rsi,2), %rcx
movq 8(%rdi,%rcx,8), %rax
addq (%rdi,%rcx,8), %rax
addq 16(%rdi,%rcx,8), %rax
popq %rbp
ret
Leh_func_end1:
I have a few questions about this assembly code:
What is the purpose of "pushq %rbp", "movq %rsp, %rbp", and "popq %rbp", if neither rbp nor rsp is used in the body of the function?
Why do rsi and rdi automatically contain the arguments to the C function (i and f, respectively) without reading them from the stack?
I tried increasing the size of Foo to 88 bytes (11 longs) and the leaq instruction became an imulq. Would it make sense to design my structs to have "rounder" sizes to avoid the multiply instructions (in order to optimize array access)? The leaq instruction was replaced with:
imulq $88, %rsi, %rcx

The function is simply building its own stack frame with these instructions. There's nothing really unusual about them. You should note, though, that due to this function's small size, it will probably be inlined when used in the code. The compiler is always required to produce a "normal" version of the function, though. Also, what #ouah said in his answer.
This is because that's how the AMD64 ABI specifies the arguments should be passed to functions.
If the class is INTEGER, the next available register of the sequence
%rdi, %rsi, %rdx, %rcx, %r8 and %r9 is used.
Page 20, AMD64 ABI Draft 0.99.5 – September 3, 2010
This is not directly related to the structure size, rather - the absolute address that the function has to access. If the size of the structure is 24 bytes, f is the address of the array containing the structures, and i is the index at which the array has to be accessed, then the byte offset to each structure is i*24. Multiplying by 24 in this case is achieved by a combination of lea and SIB addressing. The first lea instruction simply calculates i*3, then every subsequent instruction uses that i*3 and multiplies it further by 8, therefore accessing the array at the needed absolute byte offset, and then using immediate displacements to access the individual structure members ((%rdi,%rcx,8). 8(%rdi,%rcx,8), and 16(%rdi,%rcx,8)). If you make the size of the structure 88 bytes, there is simply no way of doing such a thing swiftly with a combination of lea and any kind of addressing. The compiler simply assumes that a simple imull will be more efficient in calculating i*88 than a series of shifts, adds, leas or anything else.

What is the purpose of pushq %rbp, movq %rsp, %rbp, and popq %rbp, if neither rbp nor rsp is used in the body of the function?
To keep track of the frames when you use a debugger. Add -fomit-frame-pointer to optimize (note that it should be enabled at -O3 but in a lot of gcc versions I used it is not).

3. I tried increasing the size of Foo to 88 bytes (11 longs) and the leaq instruction became an imulq. Would it make sense to design my structs to have "rounder" sizes to avoid the multiply instructions (in order to optimize array access)?
The leaq call is (essentially and in this cae) calculating k*a+b where "k" is 1, 2, 4, or 8 and "a" and "b" are registers. If "a" and "b" are the same, it can be used for structures of 1, 2, 3, 4, 5, 8, and 9 longs.
Larger structures like 16 longs may be optimizable by calculating the offset with for "k" and doubling, but I do not know if that is what the compiler will actually do; you would have to test.

Related

For GNU Assembly x64 AT&T syntax: How to add 2 quad numbers? [duplicate]

I have written a Assembly program to display the factorial of a number following AT&T syntax. But it's not working. Here is my code
.text
.globl _start
_start:
movq $5,%rcx
movq $5,%rax
Repeat: #function to calculate factorial
decq %rcx
cmp $0,%rcx
je print
imul %rcx,%rax
cmp $1,%rcx
jne Repeat
# Now result of factorial stored in rax
print:
xorq %rsi, %rsi
# function to print integer result digit by digit by pushing in
#stack
loop:
movq $0, %rdx
movq $10, %rbx
divq %rbx
addq $48, %rdx
pushq %rdx
incq %rsi
cmpq $0, %rax
jz next
jmp loop
next:
cmpq $0, %rsi
jz bye
popq %rcx
decq %rsi
movq $4, %rax
movq $1, %rbx
movq $1, %rdx
int $0x80
addq $4, %rsp
jmp next
bye:
movq $1,%rax
movq $0, %rbx
int $0x80
.data
num : .byte 5
This program is printing nothing, I also used gdb to visualize it work fine until loop function but when it comes in next some random value start entering in various register. Help me to debug so that it could print factorial.

As #ped7g points out, you're doing several things wrong: using the int 0x80 32-bit ABI in 64-bit code, and passing character values instead of pointers to the write() system call.
Here's how to print an integer in x8-64 Linux, the simple and somewhat-efficient1 way, using the same repeated division / modulo by 10.
System calls are expensive (probably thousands of cycles for write(1, buf, 1)), and doing a syscall inside the loop steps on registers so it's inconvenient and clunky as well as inefficient. We should write the characters into a small buffer, in printing order (most-significant digit at the lowest address), and make a single write() system call on that.
But then we need a buffer. The maximum length of a 64-bit integer is only 20 decimal digits, so we can just use some stack space. In x86-64 Linux, we can use stack space below RSP (up to 128B) without "reserving" it by modifying RSP. This is called the red-zone. If you wanted to pass the buffer to another function instead of a syscall, you would have to reserve space with sub $24, %rsp or something.
Instead of hard-coding system-call numbers, using GAS makes it easy to use the constants defined in .h files. Note the mov $__NR_write, %eax near the end of the function. The x86-64 SystemV ABI passes system-call arguments in similar registers to the function-calling convention. (So it's totally different from the 32-bit int 0x80 ABI, which you shouldn't use in 64-bit code.)
// building with gcc foo.S will use CPP before GAS so we can use headers
#include <asm/unistd.h> // This is a standard Linux / glibc header file
// includes unistd_64.h or unistd_32.h depending on current mode
// Contains only #define constants (no C prototypes) so we can include it from asm without syntax errors.
.p2align 4
.globl print_integer #void print_uint64(uint64_t value)
print_uint64:
lea -1(%rsp), %rsi # We use the 128B red-zone as a buffer to hold the string
# a 64-bit integer is at most 20 digits long in base 10, so it fits.
movb $'\n', (%rsi) # store the trailing newline byte. (Right below the return address).
# If you need a null-terminated string, leave an extra byte of room and store '\n\0'. Or push $'\n'
mov $10, %ecx # same as mov $10, %rcx but 2 bytes shorter
# note that newline (\n) has ASCII code 10, so we could actually have stored the newline with movb %cl, (%rsi) to save code size.
mov %rdi, %rax # function arg arrives in RDI; we need it in RAX for div
.Ltoascii_digit: # do{
xor %edx, %edx
div %rcx # rax = rdx:rax / 10. rdx = remainder
# store digits in MSD-first printing order, working backwards from the end of the string
add $'0', %edx # integer to ASCII. %dl would work, too, since we know this is 0-9
dec %rsi
mov %dl, (%rsi) # *--p = (value%10) + '0';
test %rax, %rax
jnz .Ltoascii_digit # } while(value != 0)
# If we used a loop-counter to print a fixed number of digits, we would get leading zeros
# The do{}while() loop structure means the loop runs at least once, so we get "0\n" for input=0
# Then print the whole string with one system call
mov $__NR_write, %eax # call number from asm/unistd_64.h
mov $1, %edi # fd=1
# %rsi = start of the buffer
mov %rsp, %rdx
sub %rsi, %rdx # length = one_past_end - start
syscall # write(fd=1 /*rdi*/, buf /*rsi*/, length /*rdx*/); 64-bit ABI
# rax = return value (or -errno)
# rcx and r11 = garbage (destroyed by syscall/sysret)
# all other registers = unmodified (saved/restored by the kernel)
# we don't need to restore any registers, and we didn't modify RSP.
ret
To test this function, I put this in the same file to call it and exit:
.p2align 4
.globl _start
_start:
mov $10120123425329922, %rdi
# mov $0, %edi # Yes, it does work with input = 0
call print_uint64
xor %edi, %edi
mov $__NR_exit, %eax
syscall # sys_exit(0)
I built this into a static binary (with no libc):
$ gcc -Wall -static -nostdlib print-integer.S && ./a.out
10120123425329922
$ strace ./a.out > /dev/null
execve("./a.out", ["./a.out"], 0x7fffcb097340 /* 51 vars */) = 0
write(1, "10120123425329922\n", 18) = 18
exit(0) = ?
+++ exited with 0 +++
$ file ./a.out
./a.out: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, BuildID[sha1]=69b865d1e535d5b174004ce08736e78fade37d84, not stripped
Footnote 1: See Why does GCC use multiplication by a strange number in implementing integer division? for avoiding div r64 for division by 10, because that's very slow (21 to 83 cycles on Intel Skylake). A multiplicative inverse would make this function actually efficient, not just "somewhat". (But of course there'd still be room for optimizations...)
Related: Linux x86-32 extended-precision loop that prints 9 decimal digits from each 32-bit "limb": see .toascii_digit: in my Extreme Fibonacci code-golf answer. It's optimized for code-size (even at the expense of speed), but well-commented.
It uses div like you do, because that's smaller than using a fast multiplicative inverse). It uses loop for the outer loop (over multiple integer for extended precision), again for code-size at the cost of speed.
It uses the 32-bit int 0x80 ABI, and prints into a buffer that was holding the "old" Fibonacci value, not the current.
Another way to get efficient asm is from a C compiler. For just the loop over digits, look at what gcc or clang produce for this C source (which is basically what the asm is doing). The Godbolt Compiler explorer makes it easy to try with different options and different compiler versions.
See gcc7.2 -O3 asm output which is nearly a drop-in replacement for the loop in print_uint64 (because I chose the args to go in the same registers):
void itoa_end(unsigned long val, char *p_end) {
const unsigned base = 10;
do {
*--p_end = (val % base) + '0';
val /= base;
} while(val);
// write(1, p_end, orig-current);
}
I tested performance on a Skylake i7-6700k by commenting out the syscall instruction and putting a repeat loop around the function call. The version with mul %rcx / shr $3, %rdx is about 5 times faster than the version with div %rcx for storing a long number-string (10120123425329922) into a buffer. The div version ran at 0.25 instructions per clock, while the mul version ran at 2.65 instructions per clock (although requiring many more instructions).
It might be worth unrolling by 2, and doing a divide by 100 and splitting up the remainder of that into 2 digits. That would give a lot better instruction-level parallelism, in case the simpler version bottlenecks on mul + shr latency. The chain of multiply/shift operations that brings val to zero would be half as long, with more work in each short independent dependency chain to handle a 0-99 remainder.
Related:
NASM version of this answer, for x86-64 or i386 Linux How do I print an integer in Assembly Level Programming without printf from the c library?
How to convert a binary integer number to a hex string? - Base 16 is a power of 2, conversion is much simpler and doesn't require div.

Several things:
0) I guess this is 64b linux environment, but you should have stated so (if it is not, some of my points will be invalid)
1) int 0x80 is 32b call, but you are using 64b registers, so you should use syscall (and different arguments)
2) int 0x80, eax=4 requires the ecx to contain address of memory, where the content is stored, while you give it the ASCII character in ecx = illegal memory access (the first call should return error, i.e. eax is negative value). Or using strace <your binary> should reveal the wrong arguments + error returned.
3) why addq $4, %rsp? Makes no sense to me, you are damaging rsp, so the next pop rcx will pop wrong value, and in the end you will run way "up" into the stack.
... maybe some more, I didn't debug it, this list is just by reading the source (so I may be even wrong about something, although that would be rare).
BTW your code is working. It just doesn't do what you expected. But work fine, precisely as the CPU is designed and precisely what you wrote in the code. Whether that does achieve what you wanted, or makes sense, that's different topic, but don't blame the HW or assembler.
... I can do a quick guess how the routine may be fixed (just partial hack-fix, still needs rewrite for syscall under 64b linux):
next:
cmpq $0, %rsi
jz bye
movq %rsp,%rcx ; make ecx to point to stack memory (with stored char)
; this will work if you are lucky enough that rsp fits into 32b
; if it is beyond 4GiB logical address, then you have bad luck (syscall needed)
decq %rsi
movq $4, %rax
movq $1, %rbx
movq $1, %rdx
int $0x80
addq $8, %rsp ; now rsp += 8; is needed, because there's no POP
jmp next
Again didn't try myself, just writing it from head, so let me know how it changed situation.

How are char arrays / strings stored in binary files?

When I compile this code using different compilers and inspect the output in a hex editor I am expecting to find the string "Nancy" somewhere.
#include <stdio.h>
int main()
{
char temp[6] = "Nancy";
printf("%s", temp);
return 0;
}
The output file for gcc -o main main.c looks like this:
The output for g++ -o main main.c, I can't see to find "Nancy" anywhere.
Compiling the same code in visual studio (MSVC 1929) I see the full string in a hex editor:
Why do I get some random bytes in the middle of the string in (1)?

There is no single rule about how a compiler stores data in the output files it produces.
Data can be stored in a “constant” section.
Data can be built into the “immediate” operands of instructions, in which data is encoded in various fields of the bits that encode an instruction.
Data can be computed from other data by instructions generated by the compiler.
I suspect the case where you see “Nanc” in one place and “y” in another is the compiler using a load instruction (may be written with “mov”) that loads the bytes forming “Nanc” as an immediate operand and another load instruction that loads the bytes forming “y” with a trailing null character, along with other instructions to store the loaded data on the stack and pass its address to printf.
You have not provided enough information to diagnose the g++ case: You did not name the compiler or its version number or provide any part of the generated output.

I reproduced it, using gcc 9.3.0 (Linux Mint 20.2), on x86-64 system (Intel
Result of hexdump -C:
Note the byte sequence is the same.
So I use gcc -S -c:
.file "teststr.c"
.text
.section .rodata
.LC0:
.string "%s"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
endbr64
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movq %fs:40, %rax
movq %rax, -8(%rbp)
xorl %eax, %eax
movl $1668178254, -14(%rbp) # NOTE THIS PART HERE
movw $121, -10(%rbp) # AND HERE
leaq -14(%rbp), %rax
movq %rax, %rsi
leaq .LC0(%rip), %rdi
movl $0, %eax
call printf#PLT
movl $0, %eax
movq -8(%rbp), %rdx
xorq %fs:40, %rdx
je .L3
call __stack_chk_fail#PLT
.L3:
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0"
.section .note.GNU-stack,"",#progbits
.section .note.gnu.property,"a"
.align 8
.long 1f - 0f
.long 4f - 1f
.long 5
0:
.string "GNU"
1:
.align 8
.long 0xc0000002
.long 3f - 2f
2:
.long 0x3
3:
.align 8
4:
The highlighted value 1668178254 is hex 636E614E or "cnaN" (which, due to the endian reversal as x86 is a little-endian system, becomes "Nanc") in ASCII encoding, and 121 is hex 79, or "y".
So it uses two move instructions instead of a loop copy from a byte string section of the file given it's a short string, and the intervening "garbage" is (I believe) the following movw instruction. Likely a way to optimize the initialization, versus looping byte-by-byte through memory, even though no optimization flag was "officially" given to the compiler - that's the thing, the compiler can do what it wants to do in this regard. Microsoft's compiler, then, seems to be more "pedantic" in how it compiles because it does, in fact, apparently forgo that optimization in favor of putting the string together contiguously.

Generally a compiled program is split into different types of "section". The assembler file will use directives to switch between them.
Code (".text")
Static read-only data (".section .rodata")
Initialised global or static variables (".data")
Uninitialised (or zero-initialized) global or static variables (".bss")
String literals in C can be used in two different ways.
As a pointer to constant data.
As an initaliser for an array.
If a string literal is used as a pointer then it is likely the compiler will place the string data in the read only data section.
If a string literal is used to initialise a global/static array then it is likely the compiler will place the array in the initilised data section (or the read-only data section if the array is declared as const).
However in your case the array you are initialising is an automatic local variable. So it can't be pre-initialised before program start. The compiler must include code to initialise it each time your function runs.
The compiler might choose to do that by storing the string in a read-only data location and then using a copy routine (either inlined or a call) to copy it to the local array. (In that case there will be a contiguous copy of the whole thing, otherwise there won't be.) It may chose to simply generate instructions to set the elements of the array one by one. It may choose to generate instructions that set several array elements at the same time. (e.g. 4 bytes and then 2 bytes, including the terminating '\0')
P.S. I've noticed some people posting https//godbolt.org/ links on other answers to this question. The Compiler Explorer is a useful tool but be aware that it hides the section switching directives from the assembler output by default.

Why is there a number 22 in GCC's implementation of a VLA (variable-length array)?

int read_val();
long read_and_process(int n) {
long vals[n];
for (int i = 0; i < n; i++)
vals[i] = read_val();
return vals[n-1];
}
The assembly language code compiled by x86-64 GCC 5.4 is:
read_and_process(int):
pushq %rbp
movslq %edi, %rax
>>> leaq 22(,%rax,8), %rax
movq %rsp, %rbp
pushq %r14
pushq %r13
pushq %r12
pushq %rbx
andq $-16, %rax
leal -1(%rdi), %r13d
subq %rax, %rsp
testl %edi, %edi
movq %rsp, %r14
jle .L3
leal -1(%rdi), %eax
movq %rsp, %rbx
leaq 8(%rsp,%rax,8), %r12
movq %rax, %r13
.L4:
call read_val()
cltq
addq $8, %rbx
movq %rax, -8(%rbx)
cmpq %r12, %rbx
jne .L4
.L3:
movslq %r13d, %r13
movq (%r14,%r13,8), %rax
leaq -32(%rbp), %rsp
popq %rbx
popq %r12
popq %r13
popq %r14
popq %rbp
ret
Why is there a need to calculate 8*%rax+22 and then AND with -16, since there could be 8*%rax+16, which gives the same result and looks more natural?
Other assembly language code compiled by x86-64 GCC 11.2 looks almost the same, with the number 22 being replaced by 15. So is the number determined just by random, or because of some reasons?

Summary: The number is not random, it's part of a calculation that ensures proper stack alignment. The number should be 15, and the 22 is the result of a minor bug in old versions of GCC.
Recall that the x86-64 SysV ABI mandates 16-byte stack alignment; the stack pointer must be a multiple of 16 prior to any call instruction. Hence when we enter read_and_process, the stack pointer is 8 less than a multiple of 16, because the call that got us here pushed 8 bytes. So prior to calling read_val(), the stack pointer must be decremented by 8 more than a multiple of 16, i.e. an odd multiple of 8. The prologue pushes an odd number of registers (five, namely rbp, r14, r13, r12, rbx) at 8 bytes each. So the remaining stack adjustment must be a multiple of 16.
So whatever amount of memory is to be allocated for the array vals, it must be rounded up to a multiple of 16. A standard way to do that is to add 15, then AND with -16: adjusted = (orig + 15) & -16.
Why does that work? -16, thanks to two's complement arithmetic, has the low 4 bits clear and the others set, so AND with -16 results in a multiple of 16 - but since the AND clears low-order bits, the result of x & -16 is less than x; this is rounding down. If we add 15 first (which is, of course, 1 less than 16), the net effect is to round up instead. Adding 15 to orig will cause it to pass a multiple of 16, and then & -16 will round down to that multiple of 16. Unless orig was already a multiple of 16, in which case orig+15 rounds down back to orig itself. So this does the right thing in all cases.
That's what GCC does from 8.1.0 onward. Adding 15 is baked into the same lea that multiplies n by 8, and the AND with -16 comes several lines later.
In this case, since orig = 8*n is already a multiple of 8, there are other values besides 15 that would work just as well; 8, for instance (though not 16, see below). But using 15 is completely equivalent mathematically and in terms of code size and speed, and since 15 works regardless of previous alignment, the compiler authors can just use 15 unconditionally without writing extra code to keep track of what alignment orig may already have.
But adding 22 instead, as older GCC does, is clearly wrong. If orig was already a multiple of 16, say orig = 32, then orig+22 is 54, which rounds down to 48. But 32 bytes was already a perfectly good size so we just wasted 16 bytes for no reason. (Here orig is 8*n so this would occur if the input n is even.) For similar reasons, your suggestion of using 16 instead of 22 would also be wrong.
So the 22 is a bug. It's a pretty minor bug; the resulting code still works just fine and complies with the ABI, and the only ill effect is that sometimes a little bit of stack space is wasted. But it was fixed for GCC 8.1.0 by a commit entitled "Improve alloca alignment". (alloca is an old non-standard function that does dynamic stack allocation, and compiler writers often use the term to refer to any stack allocation.)
Apparently the issue was that some previous pass of the compiler had determined that the size needed to be aligned to (at least) 8 bytes, which would have been accomplished by adding 7 and ANDing with -8 (which might later be optimized away when the compiler realized later that n*8 is already aligned to 8 bytes). Now this constraint should be made redundant when the compiler realizes that 16-byte alignment is actually required, as every multiple of 16 is already a multiple of 8. But the compiler erroneously adds the offsets 7 and 15, when the right thing to do was to take their maximum (which is what the commit implemented). And 7 + 15 is... 22.
If you compile the code using GCC 5.4 with optimizations off, you can see these two operations happening separately:
lea rdx, [rax+7] ; add 7 to rax and write to rdx
mov eax, 16
sub rax, 1 ; now rax = 15
add rax, rdx ; add 15 to rdx
and with optimizations on, the optimizer combines these into a single add of 22 - without noticing that the add of 7 should not have been there to begin with. In newer versions of GCC with -O0, the lea rdx, [rax+7] is gone.

why there need to calculate 8*%rax+22 and then AND with -16, since there could be 8*%rax+16, which gives the same result and looks more naturely.
It does not give the same result. The expression ( ( rax*8 + 22 ) % -16 ) aligns output by 16 bytes.
On 64-bit CPUs, -16 is equivalent to 0xFFFFFFFFFFFFFFF0 When written that way, it’s obvious what the AND instruction is doing: it strips the four least significant bits from the value, and this makes the result aligned by 16 bytes, rounding down. The ( ( rax*8 + 15 ) % -16 ) expression results in the alignment by 16 bytes, rounding up. But the compiler wants eight more bytes of the alignment, because it pushed five values to the stack with five push instructions, and each one is eight bytes.
Your next question is probably going to be “why align by 16 bytes when alignof(long)=8?”
The answer is the preferred-stack-boundary compiler option. The option defaults to 4 in GCC, which means the compiler aligns stack frames by 2^4 = 16 bytes.
Try to compile the same code with -mpreferred-stack-boundary=3 (which, BTW, is the minimum allowed value for AMD64. It requires the alignment to be at least one pointer in size) and see what happens to the assembly.

64-bit GCC mixing 32-bit and 64-bit pointers

Although the code works, I'm baffled by the compiler's decision to seemingly mix 32 and 64 bit parameters of the same type. Specifically, I have a function which receives three char pointers. Looking at the assembly code, two of the three are passed as 64-bit pointers (as expected), while the third, a local constant, but character string nonetheless, is being passed as a 32-bit pointer. I don't see how my function could ever know when the 3rd parameter isn't a fully loaded 64-bit pointer. Obviously it doesn't matter as long as the higher side is 0, but I don't see it making an effort to ensure that. Anything could be in the high side of RDX in this example. What am I missing? BTW, the receiving function assumes it's a full 64-bit pointer and includes this code on entry:
movq %rdx, -24(%rbp)
This is the code in question:
.LC4
.string "My Silly String"
.text
.globl funky_funk
.type funky_funk, #function
funky_funk:
pushq %rbp
movq %rsp, %rbp
pushq %rbx
subq $16, %rsp
movq %rdi, -16(%rbp) ;char *dst 64-bit
movl %esi, -20(%rbp) ;int len, 32 bits OK
movl $.LC4, %edx ;<<<<---- why is it not RDX?
movl -20(%rbp), %ecx ;int len 32-bits OK
movq -16(%rbp), %rbx ;char *dst 64-bit
movq -16(%rbp), %rax ;char *dst 64-bit
movq %rbx, %rsi ;char *dst 64-bit
movq %rax, %rdi ;char *dst 64-bit
call edc_function
void funky_funk(char *dst, int len)
{ //how will function know when
edc_function(dst, dst, STRING_LC4, len); //a str passed in 3rd parm
} //is 32-bit ptr vs 64-bit ptr?
void edc_function(char *dst, char *src, char *key, int len)
{
//so, is key a 32-bit ptr? or is key a 64-bit ptr?
}

When loading a 32 bits value in a register, the value is zero extended. You are probably working in a mode where the compiler knows the code is in the lower 32 bit addressable memory.
GCC has several memory models for x64, two of which have that property. From GCC documentation:
`-mcmodel=small'
Generate code for the small code model: the program and its
symbols must be linked in the lower 2 GB of the address space.
Pointers are 64 bits. Programs can be statically or dynamically
linked. This is the default code model.
`-mcmodel=medium'
Generate code for the medium model: The program is linked in the
lower 2 GB of the address space. Small symbols are also placed
there. Symbols with sizes larger than `-mlarge-data-threshold'
are put into large data or bss sections and can be located above
2GB. Programs can be statically or dynamically linked.
(the other ones are kernel, which is similar to small but in the upper/negative 2GB of
address space and large with no restriction).

Adding this as an answer, as it contains "part of the puzzle" for the original question:
As long as the compiler can determine [by for example specifying a memorymodel that satisfies this] that .LC4 is within the first 4GB, it can do this. %edx will be loaded with 32 bits of the address of LC4, and upper bits set to zero, so when the edc_function() is called, it can use the full 64-bits of %rdx, and as long as the address is within the lower 4GB, it will work out fine.

c & gcc : Stack growth and alignment - for a 64 bit machine

I have the following program. I wonder why it outputs -4 on the following 64 bit machine? Which of my assumptions went wrong ?
[Linux ubuntu 3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC
2012 x86_64 x86_64 x86_64 GNU/Linux]
In the above machine and gcc compiler, by default b should be pushed first and a second.
The stack grows downwards. So b should have higher address and a have lower address. So result should be positive. But I got -4. Can anybody explain this ?
The arguments are two chars occupying 2 bytes in the stack frame. But I saw the difference as 4 where as I am expecting 1. Even if somebody says it is because of alignment, then I am wondering a structure with 2 chars is not aligned at 4 bytes.
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
void CompareAddress(char a, char b)
{
printf("Differs=%ld\n", (intptr_t )&b - (intptr_t )&a);
}
int main()
{
CompareAddress('a','b');
return 0;
}
/* Differs= -4 */

Here's my guess:
On Linux in x64, the calling convention states that the first few parameters are passed by register.
So in your case, both a and b are passed by register rather than on the stack. However, since you take its address, the compiler will store it somewhere on the stack after the function is called.(Not necessary in the downwards order.)
It's also possible that the function is just outright inlined.
In either case, the compiler makes temporary stack space to store the variables. Those can be in any order and subject to optimizations. So they may not be in any particular order that you might expect.

The best way to answer these sort of questions (about behaviour of a specific compiler on a specific platform) is to look at the assembler. You can get gcc to dump its assembler by passing the -S flag (and the -fverbose-asm flag is nice too). Running
gcc -S -fverbose-asm file.c
gives a file.s that looks a little like (I've removed all the irrelevant bits, and the bits in parenthesis are my notes):
CompareAddress:
# ("allocate" memory on the stack for local variables)
subq $16, %rsp
# (put a and b onto the stack)
movl %edi, %edx # a, tmp62
movl %esi, %eax # b, tmp63
movb %dl, -4(%rbp) # tmp62, a
movb %al, -8(%rbp) # tmp63, b
# (get their addresses)
leaq -8(%rbp), %rdx #, b.0
leaq -4(%rbp), %rax #, a.1
subq %rax, %rdx # a.1, D.4597 (&b - &a)
# (set up the parameters for the printf call)
movl $.LC0, %eax #, D.4598
movq %rdx, %rsi # D.4597,
movq %rax, %rdi # D.4598,
movl $0, %eax #,
call printf #
main:
# (put 'a' and 'b' into the registers for the function call)
movl $98, %esi #,
movl $97, %edi #,
call CompareAddress
(This question explains nicely what [re]bp and [re]sp are.)
The reason the difference is negative is the stack grows downward: i.e. if you push two things onto the stack, the one you push first will have a larger address, and a is pushed before b.
The reason it is -4 rather than -1 is the compiler has decided that aligning the arguments to 4 byte boundaries is "better", probably because a 32 bit/64 bit CPU deals with 4 bytes at time better than it handles single bytes.
(Also, looking at the assembler shows the effect that -mpreferred-stack-boundary has: it essentially means that memory on the stack is allocated in different sized chunks.)

I think the answer that program given you is correct, the default preferred-stack-boundary of GCC is 4, you can set -mpreferred-stack-boundary=num to GCC options to change the stack boudary, then program will give you the different answer according your set.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight