c & gcc : Stack growth and alignment - for a 64 bit machine

c & gcc : Stack growth and alignment - for a 64 bit machine - c

I have the following program. I wonder why it outputs -4 on the following 64 bit machine? Which of my assumptions went wrong ?
[Linux ubuntu 3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC
2012 x86_64 x86_64 x86_64 GNU/Linux]
In the above machine and gcc compiler, by default b should be pushed first and a second.
The stack grows downwards. So b should have higher address and a have lower address. So result should be positive. But I got -4. Can anybody explain this ?
The arguments are two chars occupying 2 bytes in the stack frame. But I saw the difference as 4 where as I am expecting 1. Even if somebody says it is because of alignment, then I am wondering a structure with 2 chars is not aligned at 4 bytes.
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
void CompareAddress(char a, char b)
{
printf("Differs=%ld\n", (intptr_t )&b - (intptr_t )&a);
}
int main()
{
CompareAddress('a','b');
return 0;
}
/* Differs= -4 */

Here's my guess:
On Linux in x64, the calling convention states that the first few parameters are passed by register.
So in your case, both a and b are passed by register rather than on the stack. However, since you take its address, the compiler will store it somewhere on the stack after the function is called.(Not necessary in the downwards order.)
It's also possible that the function is just outright inlined.
In either case, the compiler makes temporary stack space to store the variables. Those can be in any order and subject to optimizations. So they may not be in any particular order that you might expect.

The best way to answer these sort of questions (about behaviour of a specific compiler on a specific platform) is to look at the assembler. You can get gcc to dump its assembler by passing the -S flag (and the -fverbose-asm flag is nice too). Running
gcc -S -fverbose-asm file.c
gives a file.s that looks a little like (I've removed all the irrelevant bits, and the bits in parenthesis are my notes):
CompareAddress:
# ("allocate" memory on the stack for local variables)
subq $16, %rsp
# (put a and b onto the stack)
movl %edi, %edx # a, tmp62
movl %esi, %eax # b, tmp63
movb %dl, -4(%rbp) # tmp62, a
movb %al, -8(%rbp) # tmp63, b
# (get their addresses)
leaq -8(%rbp), %rdx #, b.0
leaq -4(%rbp), %rax #, a.1
subq %rax, %rdx # a.1, D.4597 (&b - &a)
# (set up the parameters for the printf call)
movl $.LC0, %eax #, D.4598
movq %rdx, %rsi # D.4597,
movq %rax, %rdi # D.4598,
movl $0, %eax #,
call printf #
main:
# (put 'a' and 'b' into the registers for the function call)
movl $98, %esi #,
movl $97, %edi #,
call CompareAddress
(This question explains nicely what [re]bp and [re]sp are.)
The reason the difference is negative is the stack grows downward: i.e. if you push two things onto the stack, the one you push first will have a larger address, and a is pushed before b.
The reason it is -4 rather than -1 is the compiler has decided that aligning the arguments to 4 byte boundaries is "better", probably because a 32 bit/64 bit CPU deals with 4 bytes at time better than it handles single bytes.
(Also, looking at the assembler shows the effect that -mpreferred-stack-boundary has: it essentially means that memory on the stack is allocated in different sized chunks.)

I think the answer that program given you is correct, the default preferred-stack-boundary of GCC is 4, you can set -mpreferred-stack-boundary=num to GCC options to change the stack boudary, then program will give you the different answer according your set.

Related

For GNU Assembly x64 AT&T syntax: How to add 2 quad numbers? [duplicate]

I have written a Assembly program to display the factorial of a number following AT&T syntax. But it's not working. Here is my code
.text
.globl _start
_start:
movq $5,%rcx
movq $5,%rax
Repeat: #function to calculate factorial
decq %rcx
cmp $0,%rcx
je print
imul %rcx,%rax
cmp $1,%rcx
jne Repeat
# Now result of factorial stored in rax
print:
xorq %rsi, %rsi
# function to print integer result digit by digit by pushing in
#stack
loop:
movq $0, %rdx
movq $10, %rbx
divq %rbx
addq $48, %rdx
pushq %rdx
incq %rsi
cmpq $0, %rax
jz next
jmp loop
next:
cmpq $0, %rsi
jz bye
popq %rcx
decq %rsi
movq $4, %rax
movq $1, %rbx
movq $1, %rdx
int $0x80
addq $4, %rsp
jmp next
bye:
movq $1,%rax
movq $0, %rbx
int $0x80
.data
num : .byte 5
This program is printing nothing, I also used gdb to visualize it work fine until loop function but when it comes in next some random value start entering in various register. Help me to debug so that it could print factorial.

As #ped7g points out, you're doing several things wrong: using the int 0x80 32-bit ABI in 64-bit code, and passing character values instead of pointers to the write() system call.
Here's how to print an integer in x8-64 Linux, the simple and somewhat-efficient1 way, using the same repeated division / modulo by 10.
System calls are expensive (probably thousands of cycles for write(1, buf, 1)), and doing a syscall inside the loop steps on registers so it's inconvenient and clunky as well as inefficient. We should write the characters into a small buffer, in printing order (most-significant digit at the lowest address), and make a single write() system call on that.
But then we need a buffer. The maximum length of a 64-bit integer is only 20 decimal digits, so we can just use some stack space. In x86-64 Linux, we can use stack space below RSP (up to 128B) without "reserving" it by modifying RSP. This is called the red-zone. If you wanted to pass the buffer to another function instead of a syscall, you would have to reserve space with sub $24, %rsp or something.
Instead of hard-coding system-call numbers, using GAS makes it easy to use the constants defined in .h files. Note the mov $__NR_write, %eax near the end of the function. The x86-64 SystemV ABI passes system-call arguments in similar registers to the function-calling convention. (So it's totally different from the 32-bit int 0x80 ABI, which you shouldn't use in 64-bit code.)
// building with gcc foo.S will use CPP before GAS so we can use headers
#include <asm/unistd.h> // This is a standard Linux / glibc header file
// includes unistd_64.h or unistd_32.h depending on current mode
// Contains only #define constants (no C prototypes) so we can include it from asm without syntax errors.
.p2align 4
.globl print_integer #void print_uint64(uint64_t value)
print_uint64:
lea -1(%rsp), %rsi # We use the 128B red-zone as a buffer to hold the string
# a 64-bit integer is at most 20 digits long in base 10, so it fits.
movb $'\n', (%rsi) # store the trailing newline byte. (Right below the return address).
# If you need a null-terminated string, leave an extra byte of room and store '\n\0'. Or push $'\n'
mov $10, %ecx # same as mov $10, %rcx but 2 bytes shorter
# note that newline (\n) has ASCII code 10, so we could actually have stored the newline with movb %cl, (%rsi) to save code size.
mov %rdi, %rax # function arg arrives in RDI; we need it in RAX for div
.Ltoascii_digit: # do{
xor %edx, %edx
div %rcx # rax = rdx:rax / 10. rdx = remainder
# store digits in MSD-first printing order, working backwards from the end of the string
add $'0', %edx # integer to ASCII. %dl would work, too, since we know this is 0-9
dec %rsi
mov %dl, (%rsi) # *--p = (value%10) + '0';
test %rax, %rax
jnz .Ltoascii_digit # } while(value != 0)
# If we used a loop-counter to print a fixed number of digits, we would get leading zeros
# The do{}while() loop structure means the loop runs at least once, so we get "0\n" for input=0
# Then print the whole string with one system call
mov $__NR_write, %eax # call number from asm/unistd_64.h
mov $1, %edi # fd=1
# %rsi = start of the buffer
mov %rsp, %rdx
sub %rsi, %rdx # length = one_past_end - start
syscall # write(fd=1 /*rdi*/, buf /*rsi*/, length /*rdx*/); 64-bit ABI
# rax = return value (or -errno)
# rcx and r11 = garbage (destroyed by syscall/sysret)
# all other registers = unmodified (saved/restored by the kernel)
# we don't need to restore any registers, and we didn't modify RSP.
ret
To test this function, I put this in the same file to call it and exit:
.p2align 4
.globl _start
_start:
mov $10120123425329922, %rdi
# mov $0, %edi # Yes, it does work with input = 0
call print_uint64
xor %edi, %edi
mov $__NR_exit, %eax
syscall # sys_exit(0)
I built this into a static binary (with no libc):
$ gcc -Wall -static -nostdlib print-integer.S && ./a.out
10120123425329922
$ strace ./a.out > /dev/null
execve("./a.out", ["./a.out"], 0x7fffcb097340 /* 51 vars */) = 0
write(1, "10120123425329922\n", 18) = 18
exit(0) = ?
+++ exited with 0 +++
$ file ./a.out
./a.out: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, BuildID[sha1]=69b865d1e535d5b174004ce08736e78fade37d84, not stripped
Footnote 1: See Why does GCC use multiplication by a strange number in implementing integer division? for avoiding div r64 for division by 10, because that's very slow (21 to 83 cycles on Intel Skylake). A multiplicative inverse would make this function actually efficient, not just "somewhat". (But of course there'd still be room for optimizations...)
Related: Linux x86-32 extended-precision loop that prints 9 decimal digits from each 32-bit "limb": see .toascii_digit: in my Extreme Fibonacci code-golf answer. It's optimized for code-size (even at the expense of speed), but well-commented.
It uses div like you do, because that's smaller than using a fast multiplicative inverse). It uses loop for the outer loop (over multiple integer for extended precision), again for code-size at the cost of speed.
It uses the 32-bit int 0x80 ABI, and prints into a buffer that was holding the "old" Fibonacci value, not the current.
Another way to get efficient asm is from a C compiler. For just the loop over digits, look at what gcc or clang produce for this C source (which is basically what the asm is doing). The Godbolt Compiler explorer makes it easy to try with different options and different compiler versions.
See gcc7.2 -O3 asm output which is nearly a drop-in replacement for the loop in print_uint64 (because I chose the args to go in the same registers):
void itoa_end(unsigned long val, char *p_end) {
const unsigned base = 10;
do {
*--p_end = (val % base) + '0';
val /= base;
} while(val);
// write(1, p_end, orig-current);
}
I tested performance on a Skylake i7-6700k by commenting out the syscall instruction and putting a repeat loop around the function call. The version with mul %rcx / shr $3, %rdx is about 5 times faster than the version with div %rcx for storing a long number-string (10120123425329922) into a buffer. The div version ran at 0.25 instructions per clock, while the mul version ran at 2.65 instructions per clock (although requiring many more instructions).
It might be worth unrolling by 2, and doing a divide by 100 and splitting up the remainder of that into 2 digits. That would give a lot better instruction-level parallelism, in case the simpler version bottlenecks on mul + shr latency. The chain of multiply/shift operations that brings val to zero would be half as long, with more work in each short independent dependency chain to handle a 0-99 remainder.
Related:
NASM version of this answer, for x86-64 or i386 Linux How do I print an integer in Assembly Level Programming without printf from the c library?
How to convert a binary integer number to a hex string? - Base 16 is a power of 2, conversion is much simpler and doesn't require div.

Several things:
0) I guess this is 64b linux environment, but you should have stated so (if it is not, some of my points will be invalid)
1) int 0x80 is 32b call, but you are using 64b registers, so you should use syscall (and different arguments)
2) int 0x80, eax=4 requires the ecx to contain address of memory, where the content is stored, while you give it the ASCII character in ecx = illegal memory access (the first call should return error, i.e. eax is negative value). Or using strace <your binary> should reveal the wrong arguments + error returned.
3) why addq $4, %rsp? Makes no sense to me, you are damaging rsp, so the next pop rcx will pop wrong value, and in the end you will run way "up" into the stack.
... maybe some more, I didn't debug it, this list is just by reading the source (so I may be even wrong about something, although that would be rare).
BTW your code is working. It just doesn't do what you expected. But work fine, precisely as the CPU is designed and precisely what you wrote in the code. Whether that does achieve what you wanted, or makes sense, that's different topic, but don't blame the HW or assembler.
... I can do a quick guess how the routine may be fixed (just partial hack-fix, still needs rewrite for syscall under 64b linux):
next:
cmpq $0, %rsi
jz bye
movq %rsp,%rcx ; make ecx to point to stack memory (with stored char)
; this will work if you are lucky enough that rsp fits into 32b
; if it is beyond 4GiB logical address, then you have bad luck (syscall needed)
decq %rsi
movq $4, %rax
movq $1, %rbx
movq $1, %rdx
int $0x80
addq $8, %rsp ; now rsp += 8; is needed, because there's no POP
jmp next
Again didn't try myself, just writing it from head, so let me know how it changed situation.

Segmentation fault when calling x86 Assembly function from C program

I am writing a C program that calls an x86 Assembly function which adds two numbers. Below are the contents of my C program (CallAssemblyFromC.c):
#include <stdio.h>
#include <stdlib.h>
int addition(int a, int b);
int main(void) {
int sum = addition(3, 4);
printf("%d", sum);
return EXIT_SUCCESS;
}
Below is the code of the Assembly function (my idea is to code from scratch the stack frame prologue and epilogue, I have added comments to explain the logic of my code) (addition.s):
.text
# Here, we define a function addition
.global addition
addition:
# Prologue:
# Push the current EBP (base pointer) to the stack, so that we
# can reset the EBP to its original state after the function's
# execution
push %ebp
# Move the EBP (base pointer) to the current position of the ESP
# register
movl %esp, %ebp
# Read in the parameters of the addition function
# addition(a, b)
#
# Since we are pushing to the stack, we need to obtain the parameters
# in reverse order:
# EBP (return address) | EBP + 4 (return value) | EBP + 8 (b) | EBP + 4 (a)
#
# Utilize advanced indexing in order to obtain the parameters, and
# store them in the CPU's registers
movzbl 8(%ebp), %ebx
movzbl 12(%ebp), %ecx
# Clear the EAX register to store the sum
xorl %eax, %eax
# Add the values into the section of memory storing the return value
addl %ebx, %eax
addl %ecx, %eax
I am getting a segmentation fault error, which seems strange considering that I think I am allocating memory in accordance with the x86 calling conventions (e.x. allocating the correct memory sections to the function's parameters). Furthermore, if any of you have a solution, it would be greatly appreciated if you could provide some advice as to how to debug an Assembly program embedded with C (I have been using the GDB debugger but it simply points to the line of the C program where the segmentation fault happens instead of the line in the Assembly program).

Your function has no epilogue. You need to restore %ebp and pop the stack back to where it was, and then ret. If that's really missing from your code, then that explains your segfault: the CPU will go on executing whatever garbage happens to be after the end of your code in memory.
You clobber (i.e. overwrite) the %ebx register which is supposed to be callee-saved. (You mention following the x86 calling conventions, but you seem to have missed that detail.) That would be the cause of your next segfault, after you fixed the first one. If you use %ebx, you need to save and restore it, e.g. with push %ebx after your prologue and pop %ebx before your epilogue. But in this case it is better to rewrite your code so as not to use it at all; see below.
movzbl loads an 8-bit value from memory and zero-extends it into a 32-bit register. Here the parameters are int so they are already 32 bits, so plain movl is correct. As it stands your function would give incorrect results for any arguments which are negative or larger than 255.
You're using an unnecessary number of registers. You could move the first operand for the addition directly into %eax rather than putting it into %ebx and adding it to zero. And on x86 it is not necessary to get both operands into registers before adding; arithmetic instructions have a mem, reg form where one operand can be loaded directly from memory. With this approach we don't need any registers other than %eax itself, and in particular we don't have to worry about %ebx anymore.
I would write:
.text
# Here, we define a function addition
.global addition
addition:
# Prologue:
push %ebp
movl %esp, %ebp
# load first argument
movl 8(%ebp), %eax
# add second argument
addl 12(%ebp), %eax
# epilogue
movl %ebp, %esp # redundant since we haven't touched esp, but will be needed in more complex functions
pop %ebp
ret
In fact, you don't need a stack frame for this function at all, though I understand if you want to include it for educational value. But if you omit it, the function can be reduced to
.text
.global addition
addition:
movl 4(%esp), %eax
addl 8(%esp), %eax
ret

You are corrupting the stacke here:
movb %al, 4(%ebp)
To return the value, simply put it in eax. Also why do you need to clear eax? that's inefficient as you can load the first value directly into eax and then add to it.
Also EBX must be saved if you intend to use it, but you don't really need it anyway.

Variable types in C and who keeps track of it

I am taking a MOOC course CS50 from Harvard. In one of the first lectures we learned about variables of different data types: int,char, etc.
What I understand is that command (say, within main function) int a = 5 reserves a number of bytes (4 for the most part) of memory on the stack and puts there a sequence of zeros and ones which represent 5.
The same sequence of zeros and ones also could mean a certain character. So somebody needs to keep track of the fact that the sequence of zeros and ones in the memory place reserved for a is to be read as an integer (and not as a character).
The question is who does keep track of it? The computer's memory by sticking a tag to this place in memory saying "hey, whatever you find in these 4 bytes read as an integer"? Or the C compiler, which knows (looking at the type int of a) that when my code asks it to do something (more precisely, to produce a machine code doing something) with the value of a it needs to treat this value as an integer?
I would really appreciate an answer tailored to a C beginner.

With the C language, it's the compiler.
At run-time, there's only the 32 bits = 4 bytes on the stack.
You ask "The computer's memory by sticking a tag to this place...": that's impossible (with current computer architectures - thanks for the hint from #Ivan). The memory itself is just 8 bits (being 0 or 1) ber byte. There is no place in memory that can tag a memory cell with whatever additional info.
There are other languages (e.g. LISP, and to some degree also Java and C#) that store an integer as a combination of the 32 bits for the number plus a few bits or bytes that contain some bit-encoded tagging that here we have an integer. So they need e.g. 6 bytes for a 32-bit integer. But with C, that's not the case. You need knowledge from the source code to correctly interpret the bits found in memory - they don't explain themselves. And there have been special architectures that supported tagging in hardware.

In C, memory is untyped; no information beyond its value is stored there. All type information is computed at compile time from the type of an expression (a variable name, a value computation, a pointer dereferencing etc.) This computation depends on the information the programmer provides through declarations (also in headers) or casts. If that information is wrong, e.g. because a function prototype's parameters are declared wrong, all bets are off. The compiler warns about or prevents mis-declarations in the same "translation unit" (file with headers), but between translation units there are no (or not many?) protections. That's one reason why C has headers: They share common type information between translation units.
C++ keeps this idea but additionally offers run time type information (as opposed to compile time type information) for polymorphic types. It's obvious that every polymorphic object must carry extra information somewhere (not necessarily close to the data though). But that is C++, not C.

For the main part it's the C compiler that keeps track.
During the compilation process the compiler builds up a large data structure called the parse tree. It also keeps track of all variables, functions, types, ... everything with a name (i.e. identifier); this is called the symbol table.
The nodes of both the parse tree and the symbol table have an entry in which the type is recorded. They keep track of all the types.
With mainly these two data structures in hand, the compiler can check if your code does not violate type rules. It allows the compiler to warn you if you use incompatible values or variable names.
C does allow implicit conversation between types. You can for example assign an int to a double. But in memory these are completely different bit patterns for the same value.
In earlier (higher abstraction level) phases of the compilation process, the compiler does not deal with bit patterns yet (or too much), and makes conversions and checks at a higher level.
But during the assembly code generating process, the compiler needs to finally figure it all out in bits. So for an int to double conversion:
int i = 5;
double d = i; // Conversion.
the compiler will generate code to make this conversion happen.
In C however it's very easy to make mistakes and mess things up. This is because C is not a very strongly typed language and is rather flexible. So a programmer also needs to be aware.
Because C does not keep track of types anymore after compilation, so when program is run, a program can often silently continue running with the wrong data after executing some of your mistakes. And if you're 'lucky' that the program crashes, the error message you is not (very) informative.

You have a stack pointer which gives an absolute offset for the topmost stack frame in memory.
For a given scope of execution, the compiler knows which variable is located relative to this stack pointer and emits access to these variables as on offset to the stack pointer. So it is primarily the compiler mapping the variables, but it's the processor which is applying this mapping.
You can easily write programs which compute or remember a memory address which used to be valid, or is just outside of a valid region. The compiler doesn't stop you from doing so, only higher level languages with reference counting and strict boundary checks do at runtime.

The compiler keeps track of all type information during translation, and it will generate the proper machine code to deal with data of different types or sizes.
Let's take the following code:
#include <stdio.h>
int main( void )
{
long long x, y, z;
x = 5;
y = 6;
z = x + y;
printf( "x = %ld, y = %ld, z = %ld\n", x, y, z );
return 0;
}
After running that through gcc -S, the assignment, addition, and print statements are translated to:
movq $5, -24(%rbp)
movq $6, -16(%rbp)
movq -16(%rbp), %rax
addq -24(%rbp), %rax
movq %rax, -8(%rbp)
movq -8(%rbp), %rcx
movq -16(%rbp), %rdx
movq -24(%rbp), %rsi
movl $.LC0, %edi
movl $0, %eax
call printf
movl $0, %eax
leave
ret
movq is the mnemonic for moving values into 64-bit words ("quadwords"). %rax is a general-purpose 64-bit register that's being used as an accumulator. Don't worry too much about the rest of it for now.
Now let's see what happens when we change those longs to shorts:
#include <stdio.h>
int main( void )
{
short x, y, z;
x = 5;
y = 6;
z = x + y;
printf( "x = %hd, y = %hd, z = %hd\n", x, y, z );
return 0;
}
Again, we run it through gcc -S to generate the machine code, et voila:
movw $5, -6(%rbp)
movw $6, -4(%rbp)
movzwl -6(%rbp), %edx
movzwl -4(%rbp), %eax
leal (%rdx,%rax), %eax
movw %ax, -2(%rbp)
movswl -2(%rbp),%ecx
movswl -4(%rbp),%edx
movswl -6(%rbp),%esi
movl $.LC0, %edi
movl $0, %eax
call printf
movl $0, %eax
leave
ret
Different mnemonics - instead of movq we get movw and movswl, we're using %eax, which is the lower 32 bits of %rax, etc.
Once more, this time with floating-point types:
#include <stdio.h>
int main( void )
{
double x, y, z;
x = 5;
y = 6;
z = x + y;
printf( "x = %f, y = %f, z = %f\n", x, y, z );
return 0;
}
gcc -S again:
movabsq $4617315517961601024, %rax
movq %rax, -24(%rbp)
movabsq $4618441417868443648, %rax
movq %rax, -16(%rbp)
movsd -24(%rbp), %xmm0
addsd -16(%rbp), %xmm0
movsd %xmm0, -8(%rbp)
movq -8(%rbp), %rax
movq -16(%rbp), %rdx
movq -24(%rbp), %rcx
movq %rax, -40(%rbp)
movsd -40(%rbp), %xmm2
movq %rdx, -40(%rbp)
movsd -40(%rbp), %xmm1
movq %rcx, -40(%rbp)
movsd -40(%rbp), %xmm0
movl $.LC2, %edi
movl $3, %eax
call printf
movl $0, %eax
leave
ret
New mnemonics (movsd), new registers (%xmm0).
So basically, after translation, there's no need to tag the data with type information; that type information is "baked in" to the machine code itself.

x86 calling convention: should arguments passed by stack be read-only?

It seems state-of-art compilers treat arguments passed by stack as read-only. Note that in the x86 calling convention, the caller pushes arguments onto the stack and the callee uses the arguments in the stack. For example, the following C code:
extern int goo(int *x);
int foo(int x, int y) {
goo(&x);
return x;
}
is compiled by clang -O3 -c g.c -S -m32 in OS X 10.10 into:
.section __TEXT,__text,regular,pure_instructions
.macosx_version_min 10, 10
.globl _foo
.align 4, 0x90
_foo: ## #foo
## BB#0:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
movl 8(%ebp), %eax
movl %eax, -4(%ebp)
leal -4(%ebp), %eax
movl %eax, (%esp)
calll _goo
movl -4(%ebp), %eax
addl $8, %esp
popl %ebp
retl
.subsections_via_symbols
Here, the parameter x(8(%ebp)) is first loaded into %eax; and then stored in -4(%ebp); and the address -4(%ebp) is stored in %eax; and %eax is passed to the function goo.
I wonder why Clang generates code that copy the value stored in 8(%ebp) to -4(%ebp), rather than just passing the address 8(%ebp) to the function goo. It would save memory operations and result in a better performance. I observed a similar behaviour in GCC too (under OS X). To be more specific, I wonder why compilers do not generate:
.section __TEXT,__text,regular,pure_instructions
.macosx_version_min 10, 10
.globl _foo
.align 4, 0x90
_foo: ## #foo
## BB#0:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
leal 8(%ebp), %eax
movl %eax, (%esp)
calll _goo
movl 8(%ebp), %eax
addl $8, %esp
popl %ebp
retl
.subsections_via_symbols
I searched for documents if the x86 calling convention demands the passed arguments to be read-only, but I couldn't find anything on the issue. Does anybody have any thought on this issue?

The rules for C are that parameters must be passed by value. A compiler converts from one language (with one set of rules) to a different language (potentially with a completely different set of rules). The only limitation is that the behaviour remains the same. The rules of the C language do not apply to the target language (e.g. assembly).
What this means is that if a compiler feels like generating assembly language where parameters are passed by reference and are not passed by value; then this is perfectly legal (as long as the behaviour remains the same).
The real limitation has nothing to do with C at all. The real limitation is linking. So that different object files can be linked together, standards are needed to ensure that whatever the caller in one object file expects matches whatever the callee in another object file provides. This is what's known as the ABI. In some cases (e.g. 64-bit 80x86) there are multiple different ABIs for the exact same architecture.
You can even invent your own ABI that's radically different (and implement your own tools that support your own radically different ABI) and that's perfectly legal as far as the C standards go; even if your ABI requires "pass by reference" for everything (as long as the behaviour remains the same).

Actually, I just compiled this function using GCC:
int foo(int x)
{
goo(&x);
return x;
}
And it generated this code:
_foo:
pushl %ebp
movl %esp, %ebp
subl $24, %esp
leal 8(%ebp), %eax
movl %eax, (%esp)
call _goo
movl 8(%ebp), %eax
leave
ret
This is using GCC 4.9.2 (on 32-bit cygwin if it matters), no optimizations. So in fact, GCC did exactly what you thought it should do and used the argument directly from where the caller pushed it on the stack.

The C programming language mandates that arguments are passed by value. So any modification of an argument (like an x++; as the first statement of your foo) is local to the function and does not propagate to the caller.
Hence, a general calling convention should require copying of arguments at every call site. Calling conventions should be general enough for unknown calls, e.g. thru a function pointer!
Of course, if you pass an address to some memory zone, the called function is free to dereference that pointer, e.g. as in
int goo(int *x) {
static int count;
*x = count++;
return count % 3;
}
BTW, you might use link-time optimizations (by compiling and linking with clang -flto -O2 or gcc -flto -O2) to perhaps enable the compiler to improve or inline some calls between translation units.
Notice that both Clang/LLVM and GCC are free software compilers. Feel free to propose an improvement patch to them if you want to (but since both are very complex pieces of software, you'll need to work some months to make that patch).
NB. When looking into produced assembly code, pass -fverbose-asm to your compiler!

About gcc-compiled x86_64 code and C code optimization

I compiled the following C code:
typedef struct {
long x, y, z;
} Foo;
long Bar(Foo *f, long i)
{
return f[i].x + f[i].y + f[i].z;
}
with the command gcc -S -O3 test.c. Here is the Bar function in the output:
.section __TEXT,__text,regular,pure_instructions
.globl _Bar
.align 4, 0x90
_Bar:
Leh_func_begin1:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
leaq (%rsi,%rsi,2), %rcx
movq 8(%rdi,%rcx,8), %rax
addq (%rdi,%rcx,8), %rax
addq 16(%rdi,%rcx,8), %rax
popq %rbp
ret
Leh_func_end1:
I have a few questions about this assembly code:
What is the purpose of "pushq %rbp", "movq %rsp, %rbp", and "popq %rbp", if neither rbp nor rsp is used in the body of the function?
Why do rsi and rdi automatically contain the arguments to the C function (i and f, respectively) without reading them from the stack?
I tried increasing the size of Foo to 88 bytes (11 longs) and the leaq instruction became an imulq. Would it make sense to design my structs to have "rounder" sizes to avoid the multiply instructions (in order to optimize array access)? The leaq instruction was replaced with:
imulq $88, %rsi, %rcx

The function is simply building its own stack frame with these instructions. There's nothing really unusual about them. You should note, though, that due to this function's small size, it will probably be inlined when used in the code. The compiler is always required to produce a "normal" version of the function, though. Also, what #ouah said in his answer.
This is because that's how the AMD64 ABI specifies the arguments should be passed to functions.
If the class is INTEGER, the next available register of the sequence
%rdi, %rsi, %rdx, %rcx, %r8 and %r9 is used.
Page 20, AMD64 ABI Draft 0.99.5 – September 3, 2010
This is not directly related to the structure size, rather - the absolute address that the function has to access. If the size of the structure is 24 bytes, f is the address of the array containing the structures, and i is the index at which the array has to be accessed, then the byte offset to each structure is i*24. Multiplying by 24 in this case is achieved by a combination of lea and SIB addressing. The first lea instruction simply calculates i*3, then every subsequent instruction uses that i*3 and multiplies it further by 8, therefore accessing the array at the needed absolute byte offset, and then using immediate displacements to access the individual structure members ((%rdi,%rcx,8). 8(%rdi,%rcx,8), and 16(%rdi,%rcx,8)). If you make the size of the structure 88 bytes, there is simply no way of doing such a thing swiftly with a combination of lea and any kind of addressing. The compiler simply assumes that a simple imull will be more efficient in calculating i*88 than a series of shifts, adds, leas or anything else.

What is the purpose of pushq %rbp, movq %rsp, %rbp, and popq %rbp, if neither rbp nor rsp is used in the body of the function?
To keep track of the frames when you use a debugger. Add -fomit-frame-pointer to optimize (note that it should be enabled at -O3 but in a lot of gcc versions I used it is not).

3. I tried increasing the size of Foo to 88 bytes (11 longs) and the leaq instruction became an imulq. Would it make sense to design my structs to have "rounder" sizes to avoid the multiply instructions (in order to optimize array access)?
The leaq call is (essentially and in this cae) calculating k*a+b where "k" is 1, 2, 4, or 8 and "a" and "b" are registers. If "a" and "b" are the same, it can be used for structures of 1, 2, 3, 4, 5, 8, and 9 longs.
Larger structures like 16 longs may be optimizable by calculating the offset with for "k" and doubling, but I do not know if that is what the compiler will actually do; you would have to test.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight