GCC: Optimizing away memory loads and stores

GCC: Optimizing away memory loads and stores - c

EDIT 1: Added another example (showing that GCC is, in principle, be capable to do what I want to achieve) and some more discussion at the end of this question.
EDIT 2: Found the malloc function attribute, which should do what. Please take a look at the very end of the question.
This is a question about how to tell the compiler that stores to a memory area are not visible outside of a region (and thus could be optimized away). To illustrate what I mean, let's take a look at the following code
int f (int a)
{
int v[2];
v[0] = a;
v[1] = 0;
while (v[0]-- > 0)
v[1] += v[0];
return v[1];
}
gcc -O2 generates the following assembly code (x86-64 gcc, trunk, on https://godbolt.org):
f:
leal -1(%rdi), %edx
xorl %eax, %eax
testl %edi, %edi
jle .L4
.L3:
addl %edx, %eax
subl $1, %edx
cmpl $-1, %edx
jne .L3
ret
.L4:
ret
As one can see, the loads and stores into the array v are gone after optimization.
Now consider the following code:
int g (int a, int *v)
{
v[0] = a;
v[1] = 0;
while (v[0]-- > 0)
v[1] += v[0];
return v[1];
}
The difference is that v is not (stack-) allocated in the function, but provided as an argument. The result of gcc -O2 in this case is:
g:
leal -1(%rdi), %edx
movl $0, 4(%rsi)
xorl %eax, %eax
movl %edx, (%rsi)
testl %edi, %edi
jle .L4
.L3:
addl %edx, %eax
subl $1, %edx
cmpl $-1, %edx
jne .L3
movl %eax, 4(%rsi)
movl $-1, (%rsi)
ret
.L4:
ret
Clearly, the code has to store the final values of v[0] and v[1] in memory as they may be observable.
Now, what I am looking for is a way to tell the compiler that the memory pointed to by v in the second example isn't accessible any more after the function g has returned so that the compiler could optimize away the memory accesses.
To have an even simpler example:
void h (int *v)
{
v[0] = 0;
}
If the memory pointed to by v isn't accessible after h returns, it should be possible to simplify the function to a single ret.
I tried to achieve what I want by playing with the strict aliasing rules but haven't succeeded.
ADDED IN EDIT 1:
GCC seems to have the necessary code built-in as the following example shows:
include <stdlib.h>
int h (int a)
{
int *v = malloc (2 * sizeof (int));
v[0] = a;
v[1] = 0;
while (v[0]-- > 0)
v[1] += v[0];
return v[1];
}
The generated code contains no loads and stores:
h:
leal -1(%rdi), %edx
xorl %eax, %eax
testl %edi, %edi
jle .L4
.L3:
addl %edx, %eax
subl $1, %edx
cmpl $-1, %edx
jne .L3
ret
.L4:
ret
In other words, GCC knows that changing the memory area pointed to by v is not observable through any side-effect of malloc. For purposes like this one, GCC has __builtin_malloc.
So I can also ask: How can user code (say a user version of malloc) make use of this functionality?
ADDED IN EDIT 2:
GCC has the following function attribute:
malloc
This tells the compiler that a function is malloc-like, i.e., that the pointer P returned by the function cannot alias any other pointer valid when the function returns, and moreover no pointers to valid objects occur in any storage addressed by P.
Using this attribute can improve optimization. Compiler predicts that a function with the attribute returns non-null in most cases. Functions like malloc and calloc have this property because they return a pointer to uninitialized or zeroed-out storage. However, functions like realloc do not have this property, as they can return a pointer to storage containing pointers.
It seems to do what I want as the following example shows:
__attribute__ (( malloc )) int *m (int *h);
int i (int a, int *h)
{
int *v = m (h);
v[0] = a;
v[1] = 0;
while (v[0]-- > 0)
v[1] += v[0];
return v[1];
}
The generated assembler code has no loads and stores:
i:
pushq %rbx
movl %edi, %ebx
movq %rsi, %rdi
call m
testl %ebx, %ebx
jle .L4
leal -1(%rbx), %edx
xorl %eax, %eax
.L3:
addl %edx, %eax
subl $1, %edx
cmpl $-1, %edx
jne .L3
popq %rbx
ret
.L4:
xorl %eax, %eax
popq %rbx
ret
However, as soon as the compiler sees a definition of m, it may forget about the attribute. For example, this is the case when the following definition is given:
__attribute__ (( malloc )) int *m (int *h)
{
return h;
}
In that case, the function is inlined and the compiler forgets about the attribute, yielding the same code as the function g.
P.S.: Initially, I thought that the restrict keyword may help, but it doesn't seem so.

EDIT: Discussion about the noinline attribute added at the end.
Using the following function definition, one can achieve the goal of my question:
__attribute__ (( malloc, noinline )) static void *get_restricted_ptr (void *p)
{
return p;
}
This function get_restricted_ptr simply returns its pointer argument but informs the compiler that the returned pointer P cannot alias any other pointer valid when the function returns, and moreover no pointers to valid objects occur in any storage addressed by P.
The use of this function is demonstrated here:
int i (int a, int *h)
{
int *v = get_restricted_ptr (h);
v[0] = a;
v[1] = 0;
while (v[0]-- > 0)
v[1] += v[0];
return;
}
The generated code does not contain loads and stores:
i:
leal -1(%rdi), %edx
xorl %eax, %eax
testl %edi, %edi
jle .L6
.L5:
addl %edx, %eax
subl $1, %edx
cmpl $-1, %edx
jne .L5
ret
.L6:
ret
ADDED IN EDIT: If the noinline attribute is left out, GCC ignores the malloc attribute. Apparently, in this case, the function gets inlined first so that there is no function call any more for which GCC would check the malloc attribute. (One can discuss whether this behaviour should be considered a bug in GCC.) With the noinline attribute, the function doesn't get inlined. Then, due to the malloc attribute, GCC understands that the call to that function is unnecessary and removes it completely.
Unfortunately, this means that the (trivial) function won't be inlined when its call is not eliminated due to the malloc attribute.

Both functions have side effects and memory reads & stores cannot be optimized out
void h (int *v)
{
v[0] = 0;
}
and
int g (int a, int *v)
{
v[0] = a;
v[1] = 0;
while (v[0]-- > 0)
v[1] += v[0];
return v[1];
}
The side effects have to be observable outside the function scope. Inline functions may have another behavior as the side effect might have to be observable outside the enclosing code.
inline int g (int a, int *v)
{
v[0] = a;
v[1] = 0;
while (v[0]-- > 0)
v[1] += v[0];
return v[1];
}
void h(void)
{
int x[2],y ;
g(y,x);
}
this code will be optimized to just a simple return
You can promise the compiler that nothing will happen to allow easier optimizations by using keyword restrict. But of course your code must keep this promise.

For C, the only restriction is that the compiler has to ensure that the code behaves the same. If the compiler can prove that the code behaves the same then it can and will remove the stores.
For example, I put this into https://godbolt.org/ :
void h (int *v)
{
v[0] = 0;
}
void foo() {
int v[2] = {1, 2};
h(v);
}
And told it to use GCC 8.2 and "-O3", and got this output:
h(int*):
mov DWORD PTR [rdi], 0
ret
foo():
ret
Note that there are two different versions of the function h() in the output. The first version exists in case other code (in other object files) want to use the function (and may be discarded by the linker). The second version of h() was inlined directly into foo() and then optimised down to absolutely nothing.
If you change the code to this:
static void h (int *v)
{
v[0] = 0;
}
void foo() {
int v[2] = {1, 2};
h(v);
}
Then it tells the compiler that the version of h() that only existed for linking with other object files isn't needed, so the compiler only generates the second version of h() and the output becomes this:
foo():
ret
Of course all optimizers in all compiler's aren't perfect - for more complex code (and for different compilers including different versions of GCC) results might be different (the compiler may fail to do this optimization). This is purely a limitation of the compiler's optimizer and not a limitation of C itself.
For cases where the compiler's optimiser isn't good enough, there are 4 possible solutions:
get a better compiler
improve the compiler's optimiser (e.g. send an email with to the compiler's developers that includes a minimal example and cross your fingers)
modify the code to make it easier for the compiler's optimiser (e.g. copy the input array into a local array, like "void h(int *v) { int temp[2]; temp[0] = v[0]; temp[1] = v[1]; ... ).
shrug and say "Oh, that's a pity" and do nothing

Related

attribute((malloc)) vs restrict

Why does gcc need __attribute__((__malloc__))? Shouldn't the same info be communicatable by declaring malloc (and similar functions) as returning restricted pointers (void *restrict malloc(size_t))?
It seems to be that that approach would be better as apart from not requiring a nonstandard feature it would also allow one to apply it to functions "returning" via a pointer (int malloc_by_arg(void *restrict*retval, size_t size);).

Even being quite similar, same function yields different optimization when restrict or __attribute__((malloc)) are added. Considering this example (included here as a reference of a good example of __attribute__((malloc))):
#include <stdlib.h>
#include <stdio.h>
int a;
void* my_malloc(int size) __attribute__ ((__malloc__))
{
void* p = malloc(size);
if (!p) {
printf("my_malloc: out of memory!\n");
exit(1);
}
return p;
}
int main() {
int* x = &a;
int* p = (int*) my_malloc(sizeof(int));
*x = 0;
*p = 1;
if (*x) printf("This printf statement to be detected as unreachable
and discarded during compilation process\n");
return 0;
}
And this one (Same code without attributes):
void* my_malloc(int size);
int a;
void* my_malloc(int size)
{
void* p = malloc(size);
if (!p) {
printf("my_malloc: out of memory!\n");
exit(1);
}
return p;
}
int main() {
int* x = &a;
int* p = (int*) my_malloc(sizeof(int));
*x = 0;
*p = 1;
if (*x) printf("This printf statement to be detected as unreachable
and discarded during compilation process\n");
return 0;
}
As we could expect, the code with the malloc attribute is better optimized (both with -O3) than without it. Let me include just the differences:
without attribute:
[...]
call ___main
movl $4, (%esp)
call _malloc
testl %eax, %eax
je L9
movl $0, _a
xorl %eax, %eax
leave
.cfi_remember_state
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
L9:
.cfi_restore_state
movl $LC0, (%esp)
call _puts
movl $1, (%esp)
call _exit
.cfi_endproc
[...]
with attribute:
[...]
call ___main
movl $4, (%esp)
call _my_malloc
movl $0, _a
xorl %eax, %eax
leave
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
.cfi_endproc
[...]
Nonetheless, the use of restrict in that case is worthless, given that it doesn't optimize the generated code.If we modify the original code to be used with restrict :
void *restrict my_malloc(int size);
int a;
void *restrict my_malloc(int size)
{
void *restrict p = malloc(size);
if (!p) {
printf("my_malloc: out of memory!\n");
exit(1);
}
return p;
}
int main() {
int* x = &a;
int* p = (int*) my_malloc(sizeof(int));
*x = 0;
*p = 1;
if (*x) printf("This printf statement to be detected as unreachable and discarded \
during compilation process\n");
return 0;
}
The asm code is exactly the same than the generated without the malloc attribute:
[...]
call ___main
movl $4, (%esp)
call _malloc
testl %eax, %eax
je L9
movl $0, _a
xorl %eax, %eax
leave
.cfi_remember_state
.cfi_restore 5
.cfi_def_cfa 4, 4
ret
L9:
.cfi_restore_state
movl $LC0, (%esp)
call _puts
movl $1, (%esp)
call _exit
.cfi_endproc
[...]
So for malloc/calloc-like functions, the use of __attribute__((__malloc__)) looks like more useful than restrict.
__attribute__((__malloc__)) and restrict have different behaviours to optimize the code, even being their definitions quite similar. That makes me think that there is no point to "merge" them, given that the compiler achieves different optimizations through different ways. Even when both are used at the same tiem, the generated code won't be more optimized than the most optimizated code with just one of them (__attribute__((__malloc__)) or restrict, depending on case). So is the programmer's choice to know which one fits better according to his/her code.
Why __attribute__((__malloc__)) is not standard? I don't know, but IMO, these similarities from the definition point of view, and differences from the behaviour point of view don't help to integrate both in the standard, with a clear, well differentiated and general speaking way.

in my test, even base on the function without attribute, it stall can optimize code with command: ~/gcc11.1.0-install/bin/aarch64-linux-gnu-gcc test2.c -O3 -S
main:
.LFB23:
.cfi_startproc
stp x29, x30, [sp, -16]!
.cfi_def_cfa_offset 16
.cfi_offset 29, -16
.cfi_offset 30, -8
mov w0, 4
mov x29, sp
bl my_malloc
adrp x1, .LANCHOR0
mov w0, 0
ldp x29, x30, [sp], 16
.cfi_restore 30
.cfi_restore 29
.cfi_def_cfa_offset 0
str wzr, [x1, #:lo12:.LANCHOR0]
ret
.cfi_endproc
.LFE23:
.size main, .-main

Faster technique for integer equality in C

I am writing a function which checks if two integers are same .I wrote it in two different manners.I want to know if there is any performance difference
Technique 1
int checkEqual(int a ,int b)
{
if (a == b)
return 1; //it means they were equal
else
return 0;
}
Technique 2
int checkEqual(int a ,int b)
{
if (!(a - b))
return 1; //it means they are equal
else
return 0;
}

In short, there is no difference of performance.
I compiled each techniques using gcc-4.8.2 with -O2 -S option (-S generates assembly codes)
Technique 1
checkEqual1:
.LFB24:
.cfi_startproc
xorl %eax, %eax
cmpl %esi, %edi
sete %al
ret
Technique 2
checkEqual2:
.LFB25:
.cfi_startproc
xorl %eax, %eax
cmpl %esi, %edi
sete %al
ret
These are exactly the same assembly code.
So these two codes will provide the same performance.
Appendix
bool checkEquals3(int a, int b) { return a == b; }
provides
checkEqual3:
.LFB26:
.cfi_startproc
xorl %eax, %eax
cmpl %esi, %edi
sete %al
ret
exactly the same assembly code too!

It doesn't make any sense whatsoever to discuss manual code optimization without a specific system in mind.
That being said, you should always leave optimizations like these to the compiler and focus on writing as readable code as possible.
Your code can be made more readable by using only one return statement. Also, indent your code.
int checkEqual (int a, int b)
{
return a == b;
}

GCC optimization missed opportunity

I'm compiling this C code:
int mode; // use aa if true, else bb
int aa[2];
int bb[2];
inline int auto0() { return mode ? aa[0] : bb[0]; }
inline int auto1() { return mode ? aa[1] : bb[1]; }
int slow() { return auto1() - auto0(); }
int fast() { return mode ? aa[1] - aa[0] : bb[1] - bb[0]; }
Both slow() and fast() functions are meant to do the same thing, though fast() does it with one branch statement instead of two. I wanted to check if GCC would collapse the two branches into one. I've tried this with GCC 4.4 and 4.7, with various levels of optimization such as -O2, -O3, -Os, and -Ofast. It always gives the same strange results:
slow():
movl mode(%rip), %ecx
testl %ecx, %ecx
je .L10
movl aa+4(%rip), %eax
movl aa(%rip), %edx
subl %edx, %eax
ret
.L10:
movl bb+4(%rip), %eax
movl bb(%rip), %edx
subl %edx, %eax
ret
fast():
movl mode(%rip), %esi
testl %esi, %esi
jne .L18
movl bb+4(%rip), %eax
subl bb(%rip), %eax
ret
.L18:
movl aa+4(%rip), %eax
subl aa(%rip), %eax
ret
Indeed, only one branch is generated in each function. However, slow() seems to be inferior in a surprising way: it uses one extra load in each branch, for aa[0] and bb[0]. The fast() code uses them straight from memory in the subls without loading them into a register first. So slow() uses one extra register and one extra instruction per call.
A simple micro-benchmark shows that calling fast() one billion times takes 0.7 seconds, vs. 1.1 seconds for slow(). I'm using a Xeon E5-2690 at 2.9 GHz.
Why should this be? Can you tweak my source code somehow so that GCC does a better job?
Edit: here are the results with clang 4.2 on Mac OS:
slow():
movq _aa#GOTPCREL(%rip), %rax ; rax = aa (both ints at once)
movq _bb#GOTPCREL(%rip), %rcx ; rcx = bb
movq _mode#GOTPCREL(%rip), %rdx ; rdx = mode
cmpl $0, (%rdx) ; mode == 0 ?
leaq 4(%rcx), %rdx ; rdx = bb[1]
cmovneq %rax, %rcx ; if (mode != 0) rcx = aa
leaq 4(%rax), %rax ; rax = aa[1]
cmoveq %rdx, %rax ; if (mode == 0) rax = bb
movl (%rax), %eax ; eax = xx[1]
subl (%rcx), %eax ; eax -= xx[0]
fast():
movq _mode#GOTPCREL(%rip), %rax ; rax = mode
cmpl $0, (%rax) ; mode == 0 ?
je LBB1_2 ; if (mode != 0) {
movq _aa#GOTPCREL(%rip), %rcx ; rcx = aa
jmp LBB1_3 ; } else {
LBB1_2: ; // (mode == 0)
movq _bb#GOTPCREL(%rip), %rcx ; rcx = bb
LBB1_3: ; }
movl 4(%rcx), %eax ; eax = xx[1]
subl (%rcx), %eax ; eax -= xx[0]
Interesting: clang generates branchless conditionals for slow() but one branch for fast()! On the other hand, slow() does three loads (two of which are speculative, one will be unnecessary) vs. two for fast(). The fast() implementation is more "obvious," and as with GCC it's shorter and uses one less register.
GCC 4.7 on Mac OS generally suffers the same issue as on Linux. Yet it uses the same "load 8 bytes then twice extract 4 bytes" pattern as Clang on Mac OS. That's sort of interesting, but not very relevant, as the original issue of emitting subl with two registers rather than one memory and one register is the same on either platform for GCC.

The reason is that in the initial intermediate code, emitted for slow(), the memory load and the subtraction are in different basic blocks:
slow ()
{
int D.1405;
int mode.3;
int D.1402;
int D.1379;
# BLOCK 2 freq:10000
mode.3_5 = mode;
if (mode.3_5 != 0)
goto <bb 3>;
else
goto <bb 4>;
# BLOCK 3 freq:5000
D.1402_6 = aa[1];
D.1405_10 = aa[0];
goto <bb 5>;
# BLOCK 4 freq:5000
D.1402_7 = bb[1];
D.1405_11 = bb[0];
# BLOCK 5 freq:10000
D.1379_3 = D.1402_17 - D.1405_12;
return D.1379_3;
}
whereas in fast() they are in the same basic block:
fast ()
{
int D.1377;
int D.1376;
int D.1374;
int D.1373;
int mode.1;
int D.1368;
# BLOCK 2 freq:10000
mode.1_2 = mode;
if (mode.1_2 != 0)
goto <bb 3>;
else
goto <bb 4>;
# BLOCK 3 freq:3900
D.1373_3 = aa[1];
D.1374_4 = aa[0];
D.1368_5 = D.1373_3 - D.1374_4;
goto <bb 5>;
# BLOCK 4 freq:6100
D.1376_6 = bb[1];
D.1377_7 = bb[0];
D.1368_8 = D.1376_6 - D.1377_7;
# BLOCK 5 freq:10000
return D.1368_1;
}
GCC relies on instruction combining pass to handle cases like this (i.e. apparently not on the peephole optimization pass) and combining works on the scope of a basic block. That's why the subtraction and load are combined in a single insn in fast() and they aren't even considered for combining in slow().
Later, in the basic block reordering pass, the subtraction in slow() is duplicated and moved into the basic blocks, which contain the loads. Now there's opportunity for the combiner to, well, combine the load and the subtraction, but unfortunately, the combiner pass is not run again (and perhaps it cannot be run that late in the compilation process with hard registers already allocated and stuff).

I don't have an answer as to why GCC is unable to optimize the code the way you want it to, but I have a way to re-organize your code to achieve similar performance. Instead of organizing your code the way you have done so in slow() or fast(), I would recommend that you define an inline function that returns either aa or bb based on mode without needing a branch:
inline int * xx () { static int *xx[] = { bb, aa }; return xx[!!mode]; }
inline int kwiky(int *xx) { return xx[1] - xx[0]; }
int kwik() { return kwiky(xx()); }
When compiled by GCC 4.7 with -O3:
movl mode, %edx
xorl %eax, %eax
testl %edx, %edx
setne %al
movl xx.1369(,%eax,4), %edx
movl 4(%edx), %eax
subl (%edx), %eax
ret
With the definition of xx(), you can redefine auto0() and auto1() like so:
inline int auto0() { return xx()[0]; }
inline int auto1() { return xx()[1]; }
And, from this, you should see that slow() now compiles into code similar or identical to kwik().

Have you tried to modify internals compilers parameters (--param name=value in man page). Those are not changed with any optimizations level (with three minor excepts).
Some of them control code reduction/deduplication.
For some optimizations in this section you can read things like « larger values can exponentially increase compilation time » .

How does an ELF pad its `short`s?

I know that function arguments are padded to target word size, but with what?
Specifically in the context of the x86 Linux GNU toolchain, what do these functions return?
int iMysteryMeat(short x)
{
return *((int *)&x);
}
unsigned uMysteryMeat(unsigned short x)
{
return *((unsigned *)&x);
}
The question is whether, when hand-coding a function in assembly, it is necessary to sterilze "small" arguments by masking or sign-extending them before using them in "large" contexts (andl, imull).
I'd also be interested whether there are any more general or cross-platform standards for this case.

This depends on the ABI. The ABI needs to specify the choice that small arguments are extended by the caller, or by the callee (and how). Unfortunately, this part of the ABI is often underspecified, leading to different choices by different compilers. So to prevent incompatibility between code compiled with different legacy compilers, most modern compilers (I know in particular about gcc on i386) err on the side of caution and do both.
int a(short x) {
return x;
}
int b(int x);
int c(short x) {
b(x);
}
gcc -m32 -O3 -S tmp.c -o tmp.s
_a:
pushl %ebp
movl %esp, %ebp
movswl 8(%ebp),%eax
leave
ret
_c:
pushl %ebp
movl %esp, %ebp
movswl 8(%ebp),%eax
movl %eax, 8(%ebp)
leave
jmp _b
Note that a does not assume any extension rule about its argument, but extends it itself. Similarly, c makes sure to extend its argument before passing it to b (via a tail call).

int iMysteryMeat(short x)
{
return *((int *)&x);
}
This is undefined behavior in C, this violates aliasing rules and may also violate alignment requirements. In short don't do this.

Although Keith's answer is in line with the spirit of my question, per Alex's request I thought I'd try this out for myself.
Interestingly, in this case the more literal answer to my example is "garbage".
#include <stdio.h>
int iMysteryMeat(short x)
{
return *((int *)&x);
}
unsigned uMysteryMeat(unsigned short x)
{
return *((unsigned *)&x);
}
int main()
{
printf("iMeat: 0x%08x\n", iMysteryMeat(-23));
printf("uMeat: 0x%08x\n", uMysteryMeat(-23));
return 0;
}
gcc -m32 -S meat.c
iMysteryMeat:
pushl %ebp
movl %esp, %ebp
subl $4, %esp
movl 8(%ebp), %eax
movw %ax, -4(%ebp)
leal -4(%ebp), %eax
movl (%eax), %eax
leave
ret
uMysteryMeat:
pushl %ebp
movl %esp, %ebp
subl $4, %esp
movl 8(%ebp), %eax
movw %ax, -4(%ebp)
leal -4(%ebp), %eax
movl (%eax), %eax
leave
ret
./a.out
iMeat: 0x0804ffe9
uMeat: 0x0043ffe9
As you can see, not only is the usual sign-extension protocol overrided (i.e. compare with Keith's a()), it actually moves x into uninitialized stack space with movw, rendering the top half of the return value garbage no matter what main() gives it.
So, again, as ouah said, never ever do this in C, and in assembly (or in general, really), always sterilize your inputs.

Post-increment, function calls, sequence point concept in GCC

There is a code fragment that GCC produce the result I didn't expect:
(I am using gcc version 4.6.1 Ubuntu/Linaro 4.6.1-9ubuntu3 for target i686-linux-gnu)
[test.c]
#include <stdio.h>
int *ptr;
int f(void)
{
(*ptr)++;
return 1;
}
int main()
{
int a = 1, b = 2;
ptr = &b;
a = b++ + f() + f() ? b : a;
printf ("b = %d\n", b);
return a;
}
In my understanding, there is a sequence point at function call.
The post-increment should be taken place before f().
see C99 5.1.2.3:
"... called sequence points, all side effects of previous evaluations
shall be complete and no side effects of subsequent evaluations shall
have taken place."
For this test case, perhaps the order of evaluation is unspecified,
but the final result should be the same. So I expect b's final result is 5.
However, after compiling this case with 'gcc test.c -std=c99', the output shows b = 3.
Then I use "gcc test.c -std=c99 -S" to see what happened:
movl $1, 28(%esp)
movl $2, 24(%esp)
leal 24(%esp), %eax
movl %eax, ptr
movl 24(%esp), %ebx
call f
leal (%ebx,%eax), %esi
call f
addl %esi, %eax
testl %eax, %eax
setne %al
leal 1(%ebx), %edx
movl %edx, 24(%esp)
testb %al, %al
je .L3
movl 24(%esp), %eax
jmp .L4
.L3:
movl 28(%esp), %eax
.L4:
movl %eax, 28(%esp)
It seems that GCC uses evaluated value before f() and perform '++'
operation after two f() calls.
I also use llvm-clang to compile this case,
and the result shows b = 5, which is what I expect.
Is my understanding incorrect on post-increment and sequence point behavior ??
Or this is a known issue of GCC461 ??

In addition to Clang, there are two other tools that you can use as reference: Frama-C's value analysis and KCC. I won't go into the details of how to install them or use them for this purpose, but they can be used to check the definedness of a C program—unlike a compiler, they are designed to tell you if the target program exhibits undefined behavior.
They have their rough edges, but they both think that b should definitely be 5 with no undefined behavior at the end of your program:
Mini:~/c-semantics $ dist/kcc ~/t.c
Mini:~/c-semantics $ ./a.out
b = 5
This is an even stronger argument than Clang thinking so (since if it was undefined behavior, Clang could still generate a program that prints b = 5).
Long story short, it looks like you have found a bug in that version of GCC. The next step is to check out the SVN to see if it's still present there.

I reported this GCC bug some time ago and it was fixed earlier this year. See http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48814

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

GCC: Optimizing away memory loads and stores - c

Related

attribute((malloc)) vs restrict

Faster technique for integer equality in C

GCC optimization missed opportunity

How does an ELF pad its `short`s?

Post-increment, function calls, sequence point concept in GCC

Categories

Resources

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

GCC: Optimizing away memory loads and stores - c

Related

__attribute__((malloc)) vs restrict

Faster technique for integer equality in C

GCC optimization missed opportunity

How does an ELF pad its `short`s?

Post-increment, function calls, sequence point concept in GCC

Categories

Resources

attribute((malloc)) vs restrict