Why does n++ execute faster than n=n+1?

Why does n++ execute faster than n=n+1? - c

In C language, Why does n++ execute faster than n=n+1?
(int n=...; n++;)
(int n=...; n=n+1;)
Our instructor asked that question in today's class. (this is not homework)

That would be true if you are working on a "stone-age" compiler...
In case of "stone-age":
++n is faster than n++ is faster than n=n+1
Machine usually have increment x as well as add const to x
In case of n++, you will have 2 memory access only (read n, inc n, write n )
In case of n=n+1, you will have 3 memory access (read n, read const, add n and const, write n)
But today's compiler will automatically convert n=n+1 to ++n, and it will do more than you may imagine!!
Also on today's out-of-order processors -despite the case of "stone-age" compiler- runtime may not be affected at all in many cases!!
Related

On GCC 4.4.3 for x86, with or without optimizations, they compile to the exact same assembly code, and thus take the same amount of time to execute. As you can see in the assembly, GCC simply converts n++ into n=n+1, then optimizes it into the one-instruction add (in the -O2).
Your instructor's suggestion that n++ is faster only applies to very old, non-optimizing compilers, which were not smart enough to select the in-place update instructions for n = n + 1. These compilers have been obsolete in the PC world for years, but may still be found for weird proprietary embedded platforms.
C code:
int n;
void nplusplus() {
n++;
}
void nplusone() {
n = n + 1;
}
Output assembly (no optimizations):
.file "test.c"
.comm n,4,4
.text
.globl nplusplus
.type nplusplus, #function
nplusplus:
pushl %ebp
movl %esp, %ebp
movl n, %eax
addl $1, %eax
movl %eax, n
popl %ebp
ret
.size nplusplus, .-nplusplus
.globl nplusone
.type nplusone, #function
nplusone:
pushl %ebp
movl %esp, %ebp
movl n, %eax
addl $1, %eax
movl %eax, n
popl %ebp
ret
.size nplusone, .-nplusone
.ident "GCC: (Ubuntu 4.4.3-4ubuntu5) 4.4.3"
.section .note.GNU-stack,"",#progbits
Output assembly (with -O2 optimizations):
.file "test.c"
.text
.p2align 4,,15
.globl nplusplus
.type nplusplus, #function
nplusplus:
pushl %ebp
movl %esp, %ebp
addl $1, n
popl %ebp
ret
.size nplusplus, .-nplusplus
.p2align 4,,15
.globl nplusone
.type nplusone, #function
nplusone:
pushl %ebp
movl %esp, %ebp
addl $1, n
popl %ebp
ret
.size nplusone, .-nplusone
.comm n,4,4
.ident "GCC: (Ubuntu 4.4.3-4ubuntu5) 4.4.3"
.section .note.GNU-stack,"",#progbits

The compiler will optimize n + 1 into nothingness.
Do you mean n = n + 1?
If so, they will compile to identical assembly. (Assuming that optimizations are on and that they're statements, not expressions)

Who says it does? Your compiler optimizes it all away, really, making it a moot point.

Modern compilers should be able to recognize both forms as equivalent and convert them to the format that works best on your target platform. There is one exception to this rule: variable accesses that have side effects. For example, if n is some memory-mapped hardware register, reading from it and writing to it may do more than just transferring a data value (reading might clear an interrupt, for instance). You would use the volatile keyword to let the compiler know that it needs to be careful about optimizing accesses to n, and in that case the compiler might generate different code from n++ (increment operation) and n = n + 1 (read, add, and store operations). However for normal variables, the compiler should optimize both forms to the same thing.

It doesn't really. The compiler will make changes specific to the target architecture. Micro-optimizations like this often have dubious benefits, but importantly, are certainly not worth the programmer's time.

Actually, the reason is that the operator is defined differently for post-fix than it is for pre-fix. ++n will increment "n" and return a reference to "n" while n++ will increment "n" will returning a const copy of "n". Hence, the phrase n = n + 1 will be more efficient. But I have to agree with the above posters. Good compilers should optimize away an unused return object.

In C language the side-effect of n++ expressions is by definition equivalent to the side effect of n = n + 1 expression. Since your code relies on the side-effects only, it is immediately obvious that the correct answer is that these expression always have exactly equivalent performance. (Regardless of any optimization settings in the compiler, BTW, since the issue has absolutely nothing to do with any optimizations.)
Any practical divergence in performance of these expressions is only possible if the compiler is intentionally (and maliciously!) trying to introduce that divergence. But in this case it can go either way, of course, i.e. whichever way the compiler's author wanted to skew it.

I think it's more like a hardware question rather than software... If I remember corectly, in older CPUs the n=n+1 requires two locations of memory, where the ++n is simply a microcontroller command... But I doubt this applies for modern architectures...

All those things depends on compiler/processor/compilation directives. So make any assumptions "what is faster in general" is not a good idea.

Related

The assembly of “b++”

In C language,what's the assemble of "b++".
I got two situations:
1) one instruction
addl $0x1,-4(%rbp)
2) three instructions
movl -4(%rbp), %eax
leal 1(%rax), %edx
movl %edx, -4(%rbp)
Are these two situations caused by the compiler?
my code:
int main()
{
int ret = 0;
int i = 2;
ret = i++;
ret = ++i;
return ret;
}
the .s file(++i use addl instrction, i++ use other)：
.file "main.c"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $0, -8(%rbp) //ret
movl $2, -4(%rbp) //i
movl -4(%rbp), %eax
leal 1(%rax), %edx
movl %edx, -4(%rbp)
movl %eax, -8(%rbp)
addl $1, -4(%rbp)
movl -4(%rbp), %eax
movl %eax, -8(%rbp)
movl -8(%rbp), %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 5.3.1-14ubuntu2) 5.3.1 20160413"
.section .note.GNU-stack,"",#progbits

The ISO standard does not mandate at all what happens under the covers. It specifies a "virtual machine" that acts in a certain way given the C instructions you provide to it.
So, if your C compiler is implemented as a C-to-Dartmouth-Basic converter, b++ is just as likely to lead to 10 let b = b + 1 as anything else :-)
If you're compiling to common assembler code, then you're likely to see a difference depending on whether you use the result, specifically b++; as opposed to a = b++ since the result of the former can be safely thrown away.
You're also likely to see massive differences based on optimisation level.
Bottom line, short of specifying all the things that can affect the output (including but not limited to compiler, target platform, and optimisation levels).

The first one is the output for ++i as part of ret = ++i. It doesn't need to keep the old value around, because it's doing ++i and then res=i. Incrementing in memory and then reloading that is a really stupid and inefficient way to compile that, but you compiled with optimization disabled so gcc isn't even trying to make good asm output.
The 2nd one is the output for i++ as part of ret = i++. It needs to keep the old value of i around, so it loads into a register and uses lea to calculate i+1 in a different register. It could have just stored to ret and then incremented the register before storing back to i, but I guess with optimizations disabled gcc doesn't notice that.
Previous answer to the previous vague question without source, and with bogus code:
The asm for a tiny expression like b++ totally depends on the surrounding code in the rest of the function (or with optimization disabled, at least the rest of the statement) and whether it's a global or local, and whether it's declared volatile.
And of course compiler optimization options have a massive impact; with optimization disabled, gcc makes a separate block of asm for every C statement so you can use the GDB jump command to go to a different source line and have the code still produce the same behaviour you'd expect from the C abstract machine. Obviously this highly constrains code-gen: nothing is kept in registers across statements. This is good for source-level debugging, but sucks to read by hand because of all the noise of store/reload.
For the choice of inc vs. add, see INC instruction vs ADD 1: Does it matter? clang -O3 -mtune=bdver2 uses inc for memory-destination increments, but with generic tuning or any Intel P6 or Sandybridge-family CPU it uses add $1, (mem) for better micro-fusion.
See How to remove "noise" from GCC/clang assembly output?, especially the link to Matt Godbolt's CppCon2017 talk about looking at and making sense of compiler asm output.
The 2nd version in your original question looks like mostly un-optimized compiler output for this weird source:
// inside some function
int b;
// leaq -4(%rbp), %rax // rax = &b
b++; // incl (%rax)
b = (int)&b; // mov %eax, -4(%rbp)
(The question has since been edited to different code; looks like the original was mis-typed by hand mixing an opcode from once line with an operand from another line. I reproduce it here so all the comments about it being weird still make sense. For the updated code, see the first half of my answer: it depends on surrounding code and having optimization disabled. Using res = b++ needs the old value of b, not the incremented value, hence different asm.)
If that's not what your source does, then you must have left out some intervening instructions or something. Or else the compiler is re-using that stack slot for something else.
I'm curious what compiler you got that from, because gcc and clang typically don't like to use results they just computed. I'd have expected incl -4(%rbp).
Also that doesn't explain mov %eax, -4(%rbp). The compiler already used the address in %rax for inc, so why would a compiler revert to a 1-byte-longer RBP-relative addressing mode instead of mov %eax, (%rax)? Referencing fewer different registers that haven't been recently written is a good thing for Intel P6-family CPUs (up to Nehalem), to reduce register-read stalls. (Otherwise irrelevant.)
Using RBP as a frame pointer (and doing increments in memory instead of keeping simple variables in registers) looks like un-optimized code. But it can't be from gcc -O0, because it computes the address before the increment, and those have to be from two separate C statements.
b++ = &b; isn't valid because b++ isn't an lvalue. Well actually the comma operator lets you do b++, b = &b; in one statement, but gcc -O0 still evaluates it in order, rather than computing the address early.
Of course with optimization enabled, b would have to be volatile to explain incrementing in memory right before overwriting it.
clang is similar, but actually does compute that address early. For b++; b = &b;, notice that clang6.0 -O0 does an LEA and keeps RAX around across the increment. I guess clang's code-gen doesn't support consistent debugging with GDB's jump the way gcc does.
leaq -4(%rbp), %rax
movl -4(%rbp), %ecx
addl $1, %ecx
movl %ecx, -4(%rbp)
movl %eax, %ecx # copy the LEA result
movl %ecx, -4(%rbp)
I wasn't able to get gcc or clang to emit the sequence of instructions you show in the question with unoptimized or optimized + volatile, on the Godbolt compiler explorer. I didn't try ICC or MSVC, though. (Although unless that's disassembly, it can't be MSVC because it doesn't have an option to emit AT&T syntax.)

Any good compiler will optimise b++ to ++b if the result of the expression is discarded. You see this particularly in increments in for loops.
That's what is happening in your "one instruction" case.

It's not typically instructive to look at un-optimized compiler output, since values (variables) will usually be updated using a load-modify-store paradigm. This might be useful initially when getting to grips with assembly, but it's not the output to expect from an optimizing compiler that maintains values, pointers, etc., in registers for frequent use. (see: locality of reference)
/* un-optimized logic: */
int i = 2;
ret = i++; /* assign ret <- i, and post-increment i (ret = i; i++ (i = 3)) */
ret = ++i; /* pre-increment i, and assign ret <- i (++i (i = 4); ret = i) */
i.e., any modern, optimising compiler can easily determine that the final value of ret is (4).
Removing all the extraneous directives, etc., gcc-7.3.0 on OS X gives me:
_main: /* Darwin x86-64 ABI adds leading underscores to symbols... */
movl $4, %eax
ret
Apple's native clang, and the MacPorts clang-6.0 set up basic stack frame, but still optimise the ret arithmetic away:
_main:
pushq %rbp
movq %rsp, %rbp
movl $4, %eax
popq %rbp
retq
Note that the Mach-O (OS X) ABI is very similar to the ELF ABI for user-space code. Just try compiling with at least -O2 to get a feel for 'real' (production) code.

if statement in assembly ouput of c code

i have this simple piece of code in c:
#include <stdio.h>
void test() {}
int main()
{
if (2 < 3) {
int zz = 10;
}
return 0;
}
when i see the assembly output of this code:
test():
pushq %rbp
movq %rsp, %rbp
nop
popq %rbp
ret
main:
pushq %rbp
movq %rsp, %rbp
movl $10, -4(%rbp) // space is created for zz on stack
movl $0, %eax
popq %rbp
ret
i got the assembly from here (default options)
I can't see where is the instruction for the conditional check?

You don't see it, because it isn't there. The compiler was able to perform analysis, and rather easily see that this branch will always be entered.
Instead of emitting a check that will do nothing but waste CPU cycles, it emits an easily optimized version of the code.
A C program is not a sequence of instructions for the CPU to perform. That's what the emitted machine code is. A C program is a description of the behavior your compiled program should have. A compiler is free to translate it in almost any way it wants, so long as you get that behavior.
It's known as "the as-if rule".

The interesting thing here is that gcc and clang optimize away the if() even at -O0, unlike some other compilers (ICC and MSVC).
gcc -O0 doesn't mean no optimization, it means no extra optimization beyond what's needed to compile at all. But gcc does have to transform through a couple internal representations of the function logic before emitting asm. (GIMPLE and Register Transfer Language). gcc doesn't have a special "dumb mode" where it slavishly transliterates every part of every C expression to asm.
Even a super-simple one-pass compiler like TCC does minor optimizations within an expression (or even a statement), like realizing that an always-true condition doesn't require branching.
gcc -O0 is the default, which you obviously used because the dead store to zz isn't optimized away.
gcc -O0 aims to compile quickly, and to give consistent debugging results.
Every C variable exists in memory, whether it's ever used or not.
Nothing is kept in registers across C statements (except variables declared register; -O0 is the only time that keyword does anything). So you can modify any C variable with a debugger while single-stepping. i.e. spill/reload everything between separate C statements. See also Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? (This is why benchmarking for -O0 is nonsense: writing the same code with fewer larger expressions is faster only at -O0, not with real settings like -O3).
Other interesting consequences: constant-propagation doesn't work, see Why does integer division by -1 (negative one) result in FPE? for a case where gcc uses div for a variable set to a constant, vs. something simpler for a literal constant.
Every statement is compiled independently, so you can even jump to a different source line (within the same function) using GDB and get consistent results. (Unlike in optimized code where that would be likely to crash or give nonsense, and definitely not match the C abstract machine).
Given all those requirements for gcc -O0 behaviour, if (2 < 3) can still be optimized to zero asm instructions. The behaviour doesn't depend on the value of any variable, and it's a single statement. There's no way it can ever be not-taken, so the simplest way to compile it is no instructions: fall-through into the { body } of the if.
Note that gcc -O0's rules / restrictions go far beyond the C as-if rule that the machine-code for a function merely has to implement all externally-visible behaviour of the C source. gcc -O3 optimizes the whole function down to just
main: # with optimization
xor eax, eax
ret
because it doesn't care about keeping asm for every C statement.
Other compilers:
See all 4 of the major x86 compilers on Godbolt.
clang is similar to gcc, but with a dead store of 0 to another spot on the stack, as well as the 10 for zz. clang -O0 is often closer to a transliteration of C into asm, for example it will use div for x / 2 instead of a shift, while gcc uses a multiplicative inverse for division by a constant even at -O0. But in this case, clang also decides that no instructions are sufficient for an always-true condition.
ICC and MSVC both emit asm for the branch, but instead of the mov $2, %ecx / cmp $3, %ecx you might expect, they both actually do 0 != 1 for no apparent reason:
# ICC18
pushq %rbp #6.1
movq %rsp, %rbp #6.1
subq $16, %rsp #6.1
movl $0, %eax #7.5
cmpl $1, %eax #7.5
je ..B1.3 # Prob 100% #7.5
movl $10, -16(%rbp) #9.16
..B1.3: # Preds ..B1.2 ..B1.1
movl $0, %eax #11.12
leave #11.12
ret #11.12
MSVC uses the xor-zeroing peephole optimization even without optimization enabled.
It's slightly interesting to look at which local / peephole optimizations compilers do even at -O0, but it doesn't tell you anything fundamental about C language rules or your code, it just tells you about compiler internals and the tradeoffs the compiler devs chose between spending time looking for simple optimizations vs. compiling even faster in no-optimization mode.
The asm is never intended to faithfully represent the C source in any kind of way that would let a decompiler reconstruct it. Just to implement equivalent logic.

It's simple. It is not there. The compiler optimized it away.
Here is the assembly when compiling with gcc without optimization:
.file "k.c"
.text
.globl test
.type test, #function
test:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
nop
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size test, .-test
.globl main
.type main, #function
main:
.LFB1:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $10, -4(%rbp)
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE1:
.size main, .-main
.ident "GCC: (Debian 6.3.0-18) 6.3.0 20170516"
.section .note.GNU-stack,"",#progbits
and here it is with optimization:
.file "k.c"
.text
.p2align 4,,15
.globl test
.type test, #function
test:
.LFB11:
.cfi_startproc
rep ret
.cfi_endproc
.LFE11:
.size test, .-test
.section .text.startup,"ax",#progbits
.p2align 4,,15
.globl main
.type main, #function
main:
.LFB12:
.cfi_startproc
xorl %eax, %eax
ret
.cfi_endproc
.LFE12:
.size main, .-main
.ident "GCC: (Debian 6.3.0-18) 6.3.0 20170516"
.section .note.GNU-stack,"",#progbits
As you can see, not only the comparison is optimized away. Almost the whole main is optimized away since it does not produce anything visible. The variable zz is never used. The only observable thing your code does is returning 0.

2 is always less tan 3 so, as the compiler know the result of 2<3 is always true, there is no need for an if decision in assembler.
The optimization means to generate less time / less code.

if (2<3)
is allways true, therefore the Compiler emmits no opcode for it.

The condition if (2<3) is always true. So a decent compiler would detect this generate the code as if the condition doesn't exist. In fact, if you optimize it with -O3, godbolt.org generates just:
test():
rep ret
main:
xor eax, eax
ret
This is again valid because a compiler is allowed optimise and transform the code as long as the observable behaviour is preserved.

How is memory managed for non-declared entities in the C language?

For example: In the following code, how and where is the number '10' used for the comparison stored?
#include<stdio.h>
#include<conio.h>
int main()
{
int x = 5;
if (x > 10)
printf("X is greater than 10");
else if (x < 10)
printf("X is lesser than 10");
else
printf("x = 10");
getch();
return 0;
}
Pardon me for not giving enough details. Instead of initializing 'x' directly with '5', if we scan and get it from the user we know how memory is allocated for 'x'. But how memory is allocated for the literal number '10' which is not stored in any variable?

In your particular code, x is initialized to 5 and is never changed. An optimizing compiler is able to constant fold and propagate that information. So it probably would generate the equivalent of
int main() {
printf("X is lesser than 10");
getch();
return 0;
}
notice that the compiler would also have done dead code elimination.
So both constants 5 and 10 would have disappeared.
BTW, <conio.h> and getch are not in standard C99 or C11. My Linux system don't have them.
In general (and depending upon the target processor's instruction set and the ABI) small constants are often embedded in some single machine code instruction (as an immediate operand), as Kilian answered. Some large constants (e.g. floating point numbers, literal strings, most const global or static arrays and aggregates) might get inserted and compiled as read only data in the code segment (then the constant inside machine register-load instructions would be an address or some offset relative to PC for PIC); see also this. Some architectures (e.g. SPARC, RISC-V, ARM, and other RISC) are able to load a wide constant in a register by two consecutive instructions (loading the constant in two parts), and this impacts the relocation format for the linker (e.g. in object files and executables, often in ELF).
I suggest to ask your compiler to emit assembler code, and have a glance at that assembler code. If using GCC (e.g. on Linux, or with Cygwin or MinGW) try to compile with gcc -Wall -O -fverbose-asm -S ; on my Debian/Linux system if I replace getch by getchar in your code I am getting:
.section .rodata.str1.1,"aMS",#progbits,1
.LC0:
.string "X is lesser than 10"
.text
.globl main
.type main, #function
main:
.LFB11:
.cfi_startproc
subq $8, %rsp #,
.cfi_def_cfa_offset 16
movl $.LC0, %edi #,
movl $0, %eax #,
call printf #
movq stdin(%rip), %rdi # stdin,
call _IO_getc #
movl $0, %eax #,
addq $8, %rsp #,
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE11:
.size main, .-main
.ident "GCC: (Debian 4.9.2-10) 4.9.2"
.section .note.GNU-stack,"",#progbits
If you are using a 64 bits Windows system, your architecture is very likely to be x86-64. There are tons of documentation describing the ISA (see answers to this) and the x86 calling conventions (and also the Linux x86-64 ABI; you'll find the equivalent document for Windows).
BTW, you should not really care how such constants are implemented. The semantics of your code should not change, whatever the compiler choose to do for implementing them. So leave the optimizations (and such low level choices) to the compiler (i.e. your implementation of C).

The constant 10 is probably stored as an immediate constant in the opcode stream. Issuing a CMP AX,10, with the constant included in the opcode, is usually both smaller and faster than a CMP AX, [BX], where the comparison value must be loaded from memory.
If the constant is too large to fit into the opcode, the alternative is to store it in memory like a static variable, but if the instruction set allows embedded constants, a good compiler should use it - after all, that addressing mode was presumably added because it has advantages over the others.

x86 calling convention: should arguments passed by stack be read-only?

It seems state-of-art compilers treat arguments passed by stack as read-only. Note that in the x86 calling convention, the caller pushes arguments onto the stack and the callee uses the arguments in the stack. For example, the following C code:
extern int goo(int *x);
int foo(int x, int y) {
goo(&x);
return x;
}
is compiled by clang -O3 -c g.c -S -m32 in OS X 10.10 into:
.section __TEXT,__text,regular,pure_instructions
.macosx_version_min 10, 10
.globl _foo
.align 4, 0x90
_foo: ## #foo
## BB#0:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
movl 8(%ebp), %eax
movl %eax, -4(%ebp)
leal -4(%ebp), %eax
movl %eax, (%esp)
calll _goo
movl -4(%ebp), %eax
addl $8, %esp
popl %ebp
retl
.subsections_via_symbols
Here, the parameter x(8(%ebp)) is first loaded into %eax; and then stored in -4(%ebp); and the address -4(%ebp) is stored in %eax; and %eax is passed to the function goo.
I wonder why Clang generates code that copy the value stored in 8(%ebp) to -4(%ebp), rather than just passing the address 8(%ebp) to the function goo. It would save memory operations and result in a better performance. I observed a similar behaviour in GCC too (under OS X). To be more specific, I wonder why compilers do not generate:
.section __TEXT,__text,regular,pure_instructions
.macosx_version_min 10, 10
.globl _foo
.align 4, 0x90
_foo: ## #foo
## BB#0:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
leal 8(%ebp), %eax
movl %eax, (%esp)
calll _goo
movl 8(%ebp), %eax
addl $8, %esp
popl %ebp
retl
.subsections_via_symbols
I searched for documents if the x86 calling convention demands the passed arguments to be read-only, but I couldn't find anything on the issue. Does anybody have any thought on this issue?

The rules for C are that parameters must be passed by value. A compiler converts from one language (with one set of rules) to a different language (potentially with a completely different set of rules). The only limitation is that the behaviour remains the same. The rules of the C language do not apply to the target language (e.g. assembly).
What this means is that if a compiler feels like generating assembly language where parameters are passed by reference and are not passed by value; then this is perfectly legal (as long as the behaviour remains the same).
The real limitation has nothing to do with C at all. The real limitation is linking. So that different object files can be linked together, standards are needed to ensure that whatever the caller in one object file expects matches whatever the callee in another object file provides. This is what's known as the ABI. In some cases (e.g. 64-bit 80x86) there are multiple different ABIs for the exact same architecture.
You can even invent your own ABI that's radically different (and implement your own tools that support your own radically different ABI) and that's perfectly legal as far as the C standards go; even if your ABI requires "pass by reference" for everything (as long as the behaviour remains the same).

Actually, I just compiled this function using GCC:
int foo(int x)
{
goo(&x);
return x;
}
And it generated this code:
_foo:
pushl %ebp
movl %esp, %ebp
subl $24, %esp
leal 8(%ebp), %eax
movl %eax, (%esp)
call _goo
movl 8(%ebp), %eax
leave
ret
This is using GCC 4.9.2 (on 32-bit cygwin if it matters), no optimizations. So in fact, GCC did exactly what you thought it should do and used the argument directly from where the caller pushed it on the stack.

The C programming language mandates that arguments are passed by value. So any modification of an argument (like an x++; as the first statement of your foo) is local to the function and does not propagate to the caller.
Hence, a general calling convention should require copying of arguments at every call site. Calling conventions should be general enough for unknown calls, e.g. thru a function pointer!
Of course, if you pass an address to some memory zone, the called function is free to dereference that pointer, e.g. as in
int goo(int *x) {
static int count;
*x = count++;
return count % 3;
}
BTW, you might use link-time optimizations (by compiling and linking with clang -flto -O2 or gcc -flto -O2) to perhaps enable the compiler to improve or inline some calls between translation units.
Notice that both Clang/LLVM and GCC are free software compilers. Feel free to propose an improvement patch to them if you want to (but since both are very complex pieces of software, you'll need to work some months to make that patch).
NB. When looking into produced assembly code, pass -fverbose-asm to your compiler!

Editing ASM result of an operation in C when compiling in GCC

me and my friend got a computer architecture project and we don't really know how to get to it. I hope you could at least point us in the right direction so we know what to look for. As our professor isn't really good at explaining what we really need to do and the subject is rather vague we'll start from the beginning.
Our task is to somehow "edit" GCC to treat some operations differently. For example when you add two char arguments in a .c program it uses addb. We need to change it to f.e. 16bit registers(addl), without using unnecessary parameters during compilation(just regular gcc p.c -o p). Why or will it work doesn't really matter at this point.
We'd like to know how we could change something inside GCC, where we can even start looking as I can't find any information about similar tasks besides making plugins/extensions. Is there anything we could read about something like this or anything we could use?

In C 'char' variables are normally added together as integers so the C compiler will already use addl. Except when it can see that it makes no difference to the result to use a smaller or faster form.
For example this C code
unsigned char a, b, c;
int i;
void func1(void) { a = b + c; }
void func2(void) { i = b + c; }
Gives this assembler for GCC.
.file "xq.c"
.text
.p2align 4,,15
.globl func1
.type func1, #function
func1:
movzbl c, %eax
addb b, %al
movb %al, a
ret
.size func1, .-func1
.p2align 4,,15
.globl func2
.type func2, #function
func2:
movzbl b, %edx
movzbl c, %eax
addl %edx, %eax
movl %eax, i
ret
.size func2, .-func2
.comm i,4,4
.comm c,1,4
.comm b,1,4
.comm a,1,4
.ident "GCC: (Debian 4.7.2-5) 4.7.2"
.section .note.GNU-stack,"",#progbits
Note that the first function uses addb but the second uses addl because the high bits of the result will be discarded in the first function when the result is stored.
This version of GCC is generating i686 code so the integers are 32bit (addl) depending on exactly what you want you may need to make the result a short or actually get a compiler version that outputs 16bit 8086 code.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight