Post-increment, function calls, sequence point concept in GCC - c

There is a code fragment that GCC produce the result I didn't expect:
(I am using gcc version 4.6.1 Ubuntu/Linaro 4.6.1-9ubuntu3 for target i686-linux-gnu)
[test.c]
#include <stdio.h>
int *ptr;
int f(void)
{
(*ptr)++;
return 1;
}
int main()
{
int a = 1, b = 2;
ptr = &b;
a = b++ + f() + f() ? b : a;
printf ("b = %d\n", b);
return a;
}
In my understanding, there is a sequence point at function call.
The post-increment should be taken place before f().
see C99 5.1.2.3:
"... called sequence points, all side effects of previous evaluations
shall be complete and no side effects of subsequent evaluations shall
have taken place."
For this test case, perhaps the order of evaluation is unspecified,
but the final result should be the same. So I expect b's final result is 5.
However, after compiling this case with 'gcc test.c -std=c99', the output shows b = 3.
Then I use "gcc test.c -std=c99 -S" to see what happened:
movl $1, 28(%esp)
movl $2, 24(%esp)
leal 24(%esp), %eax
movl %eax, ptr
movl 24(%esp), %ebx
call f
leal (%ebx,%eax), %esi
call f
addl %esi, %eax
testl %eax, %eax
setne %al
leal 1(%ebx), %edx
movl %edx, 24(%esp)
testb %al, %al
je .L3
movl 24(%esp), %eax
jmp .L4
.L3:
movl 28(%esp), %eax
.L4:
movl %eax, 28(%esp)
It seems that GCC uses evaluated value before f() and perform '++'
operation after two f() calls.
I also use llvm-clang to compile this case,
and the result shows b = 5, which is what I expect.
Is my understanding incorrect on post-increment and sequence point behavior ??
Or this is a known issue of GCC461 ??

In addition to Clang, there are two other tools that you can use as reference: Frama-C's value analysis and KCC. I won't go into the details of how to install them or use them for this purpose, but they can be used to check the definedness of a C program—unlike a compiler, they are designed to tell you if the target program exhibits undefined behavior.
They have their rough edges, but they both think that b should definitely be 5 with no undefined behavior at the end of your program:
Mini:~/c-semantics $ dist/kcc ~/t.c
Mini:~/c-semantics $ ./a.out
b = 5
This is an even stronger argument than Clang thinking so (since if it was undefined behavior, Clang could still generate a program that prints b = 5).
Long story short, it looks like you have found a bug in that version of GCC. The next step is to check out the SVN to see if it's still present there.

I reported this GCC bug some time ago and it was fixed earlier this year. See http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48814

Related

Is the conditional move optimization against the C standard?

It is a common optimization to use conditional move (assembly cmov) to optimize the conditional expression ?: in C. However, the C standard says:
The first operand is evaluated; there is a sequence point between its evaluation and the
evaluation of the second or third operand (whichever is evaluated). The second operand
is evaluated only if the first compares unequal to 0; the third operand is evaluated only if
the first compares equal to 0; the result is the value of the second or third operand
(whichever is evaluated), converted to the type described below.110)
For example, the following C code
#include <stdio.h>
int main() {
int a, b;
scanf("%d %d", &a, &b);
int c= a > b ? a + 1 : 2 + b;
printf("%d", c);
return 0;
}
will generate optimized related asm code as follows:
call __isoc99_scanf
movl (%rsp), %esi
movl 4(%rsp), %ecx
movl $1, %edi
leal 2(%rcx), %eax
leal 1(%rsi), %edx
cmpl %ecx, %esi
movl $.LC1, %esi
cmovle %eax, %edx
xorl %eax, %eax
call __printf_chk
According to the standard, the conditional expression will only have one branch evaluated. But here both branches are evaluated, which is against the standard's semantics. Is this optimization against the C standard? Or do many compiler optimizations have something inconsistent with the language standard?
The optimization is legal, due to the "as-if rule", i.e. C11 5.1.2.3p6.
A conforming implementation is just required to produce a program that when run produces the same observable behaviour as the execution of the program using the abstract semantics would have produced. The rest of the standard just describes these abstract semantics.
What the compiled program does internally does not matter at all, the only thing that matters is that when the program ends it does not have any other observable behaviour, except reading the a and b and printing the value of a + 1 or b + 2 depending on which one a or bis greater, unless something occurs that causes the behaviour be undefined. (Bad input causes a, b be uninitialized and therefore accesses undefined; range error and signed overflow can occur too.) If undefined behaviour occurs, then all bets are off.
Since accesses to volatile variables must be evaluated strictly according to the abstract semantics, you can get rid of the conditional move by using volatile here:
#include <stdio.h>
int main() {
volatile int a, b;
scanf("%d %d", &a, &b);
int c = a > b ? a + 1 : 2 + b;
printf("%d", c);
return 0;
}
compiles to
call __isoc99_scanf#PLT
movl (%rsp), %edx
movl 4(%rsp), %eax
cmpl %eax, %edx
jg .L7
movl 4(%rsp), %edx
addl $2, %edx
.L3:
leaq .LC1(%rip), %rsi
xorl %eax, %eax
movl $1, %edi
call __printf_chk#PLT
[...]
.L7:
.cfi_restore_state
movl (%rsp), %edx
addl $1, %edx
jmp .L3
by my GCC Ubuntu 7.2.0-8ubuntu3.2
The C Standard describes an abstract machine executing C code. A compiler is free to perform any optimization as long as that abstraction is not violated, i.e. a conforming program cannot tell the difference.

Preprocessor Definitions VS Local Variables, Speed Difference

I just compiled the following C code to test out the gcc optimizer (using the -O3 flag), expecting that both functions would end up generating the same set of assembly instructions:
int test1(int a, int b)
{
#define x (a*a*a+b)
#define y (a*b*a+3*b)
return x*x+x*y+y;
#undef x
#undef y
}
int test2(int a, int b)
{
int x = a*a*a+b;
int y = a*b*a+3*b;
return x*x+x*y+y;
}
But I was surprised to find that they generated slightly different assembly, and that the execution time for test1 (the code using the preprocessor instead of local variables) was a bit faster.
I've heard people say that the compiler can optimize better than humans can, and that you should tell it exactly what you want it to do; man I guess they weren't kidding. I thought the compiler was supposed to kind of guess at the programmer's intended use of local variables and replace their use if necessary... is that a false assumption?
When writing code for performance, are you better off using preprocessor definitions for the sake of readability rather than local variables? I know it looks ugly as hell, but apparently it actually makes a difference, unless I'm missing something.
Here's the assembly I got, using "gcc test.c -O3 -S". My gcc version is 4.8.2; it looks like the assembly output is the same for most versions of gcc, but not on 4.7 or 4.8 versions for some reason
test1
movl %edi, %eax
movl %edi, %edx
leal (%rsi,%rsi,2), %ecx
imull %edi, %eax
imull %esi, %edx
imull %edi, %eax
imull %edi, %edx
addl %esi, %eax
addl %ecx, %edx
leal (%rax,%rdx), %ecx
imull %ecx, %eax
addl %edx, %eax
ret
test2
movl %edi, %eax
leal (%rsi,%rsi,2), %edx
imull %edi, %eax
imull %edi, %eax
leal (%rax,%rsi), %ecx
movl %edi, %eax
imull %esi, %eax
imull %edi, %eax
addl %eax, %edx
leal (%rcx,%rdx), %eax
imull %ecx, %eax
addl %edx, %eax
ret
Trying your code at godbolt I get identical assembly for both functions with GCC, even with -O setting. Only by omitting -O flag I get different results. And this really is expected because the code is trivial to optimize.
Here is generated assembly using gcc 4.4.7 with -O flag. As you can see they are identical.
test1(int, int):
movl %edi, %eax
imull %edi, %eax
imull %eax, %edi
addl $3, %eax
imull %esi, %eax
addl %esi, %edi
leal (%rax,%rdi), %edx
imull %edi, %edx
leal (%rdx,%rax), %eax
ret
test2(int, int):
movl %edi, %eax
imull %edi, %eax
imull %eax, %edi
addl $3, %eax
imull %esi, %eax
addl %esi, %edi
leal (%rax,%rdi), %edx
imull %edi, %edx
leal (%rdx,%rax), %eax
ret
The answer is twofold:
Your statement about identical results is a misconception
I cannot reproduce your results "test1 faster than test2".
Preprocessor misconception
The results should not be identical. The preprocessor acts on (transforms) the source before it is actually compiled by the compiler with whatever options.
You can inspect the result of the preprocessor by running gcc -E main.c for example, assuming you are using a GNU compiler and your sources above are stored in a file main.c. The relevant parts become:
int test1(int a, int b)
{
return (a*a*a+b)*(a*a*a+b)+(a*a*a+b)*(a*b*a+3*b)+(a*b*a+3*b);
}
int test2(int a, int b)
{
int x = a*a*a+b;
int y = a*b*a+3*b;
return x*x+x*y+y;
}
Obviously, the first version uses roughly two times more mathematical operations than the second one. Then the compiler and its optimiser come into play …
(NB: Ideally you could analyse the number of CPU cycles generated by the assembler code. Use e.g. gcc -S main.c and look at main.s; you probably know that. Version 2 should "win" in that case.)
Runtime testing and optimising
In order to compare our results, you should post your test code. When testing you need to average out short term fluctuations and time granularity limits of your CPU. Hence you are likely to run in loops over the same code.
int i=100000000;
while (--i>0) {
int r;
r = test1(3, 4);
}
Without optimiser, test1 runs clearly about 20% slower than test2.
However, the optimiser will analyse also the calling code and can optimise away the multiple call with identical arguments or calls with unused variables (r in this case).
Therefore you must fool the compiler to effectively make the calls, alike
int r = 0;
while (--i>0) {
r += test1(3, i);
}
When I tried that, I get identical runtimes with a percent level precision. I.e. sometimes time1 is faster, sometimes time2 is faster, when I repeat the comparison several times.
You should look into the optimiser documentation to understand which optimising options you need to outsmart in your tests.
And I confirm what #Ville Krumlinde states: I get identical code for the assembly output, even with -O level optimisation (gcc 4.4.7 on my desktop). The code only contains 9 operations in assembler, which makes me believe that the optimiser "knows" enough about algebraic optimisation to simplify your formulas.
So you may just be taken by a fake optimiser effect of your test frame after all.

Difference of n/=10 and n=n/10

is there any difference between n/=10 and n=n/10 in execution speed wise?
just like n-- and --n are differ in their execution speed wise also...
No, not really:
[C99: 6.5.16.2/3]: A compound assignment of the form E1 op= E2 differs from the simple assignment expression E1 = E1 op (E2) only in that the lvalue E1 is evaluated only once.
So, this has consequences only if your n is a non-trivial expression with side-effects (such as a function call).
Otherwise, I suppose in theory an intermediate temporary variable will be involved, but you'd have to be remarkably unlucky for such a temporary to actually survive in your compiled executable. You're not going to see any performance difference between the two approaches.
Confirm this with benchmarks, and by comparing the resulting assembly.
Given the following C-code:
int f1(int n) {
n /= 10;
return n;
}
int f2(int n) {
n = n / 10;
return n;
}
compiled with gcc -O4 essentially results in
f1:
movl %edi, %eax
movl $1717986919, %edx
sarl $31, %edi
imull %edx
sarl $2, %edx
subl %edi, %edx
movl %edx, %eax
ret
f2:
movl %edi, %eax
movl $1717986919, %edx
sarl $31, %edi
imull %edx
sarl $2, %edx
subl %edi, %edx
movl %edx, %eax
ret
I have omitted some boilerplate which is part of the listing in reality.
In this specific case, there is no difference between the two alternatives.
Depending on the compiler used, on the actual environment where the instructions are executed and on compiler optimization levels, the generated code might be different. But you can always use this approach to check if the resulting machine code differs or not.
There is no difference b/w the both.
I have checked it in the KEIL cross compiler for both expressions and same execution time required:
=================================================
5: x=x/5;
6:
C:0x0005 EF **MOV A,R7**
C:0x0006 75F005 **MOV B(0xF0),#0x05**
C:0x0009 84 **DIV AB**
7: x/=5;
C:0x000A 75F005 **MOV B(0xF0),#0x05**
C:0x000D 84 **DIV AB**
C:0x000E FF **MOV R7,A**
================================================
So, there is no any difference, like with --n and n--.

Explain the output of this C Program? [duplicate]

This question already has answers here:
Why are these constructs using pre and post-increment undefined behavior?
(14 answers)
Closed 9 years ago.
#include<stdio.h>
#define CUBE(x) (x*x*x)
int main()
{
int a, b=3;
a = CUBE(++b);
printf("%d, %d\n", a, b);
return 0;
}
This code returns the value of a=150 and b=6. Please explain this.
I think when it executes the value of a will be calculated as a=4*5*6=120 but it isn't true according to the compiler , so please explain the logic....
There's no logic, it's undefined behavior because
++b * ++b * ++b;
modifies and reads b 3 times with no interleaving sequence points.
Bonus: You'll see another weird behavior if you try CUBE(1+2).
In addition to what Luchian Grigore said (which explains why you observe this weird behavior) you should notice that this macro is horrible: it can cause subtle and very difficult to track-down bugs, especially when called with a statement that has side-effects (like ++b) since this will cause the statement to execute multiple times.
You should learn three thing from this:
Never reference a macro argument more than once in a macro. While there are exceptions to this rule, you should prefer to think of it as absolute.
Try to avoid calling macros with statements that contain side-effects if possible.
Try to avoid function-like macros when possible. Use inline functions instead.
Its undefined behavior to change same variable more then once in a sequence. And because this reason you will get difference results with different compilers for your code.
By the chance I am also getting same result a = 150 and b = 6 with my compiler.
gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)
Your macro expression a = CUBE(++b); expanses as
a = ++b * ++b * ++b;
And b is change more then once before end of full expression..
But how my compiler convert this expression at low level (may be your compiler do similarly and you can try with same technique). for this I compiled the source C with -S option and I got an assembly code.
gcc x.c -S
you will get x.s file.
I am showing partial useful asm code (read comments)
Because you want to know how does 150 output, thats why I am adding my answer
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $16, %rsp
movl $3, -8(%rbp) // b = 3
addl $1, -8(%rbp) // b = 4
addl $1, -8(%rbp) // b = 5
movl -8(%rbp), %eax // eax = b
imull -8(%rbp), %eax // 5*5 = 25
addl $1, -8(%rbp) // 6 `b` become 6 and
imull -8(%rbp), %eax // 6 * 25 = 150
movl %eax, -4(%rbp) // 150 assign to `a` become 150
movl $.LC0, %eax // printf function stuff...
movl -8(%rbp), %edx
movl -4(%rbp), %ecx
movl %ecx, %esi
movq %rax, %rdi
On inspecting this assembly code I can understand it evaluate the expression like
a = 5 * 5 * 6 thus a becomes 150 and after three increments b becomes 6.
Although different compilers produce different result but I think, 150 cab only be evaluated this sequence for b=3 and your expression in 5*5*6

stack layout for process

why is the output 44 instead of 12, &x,eip,ebp,&j should be the stack layout for the code
given below i guess, if so then it must have 12 as output, but i am getting it 44?
so help me understand the concept,altough the base pointer is relocated for every instant of execution, the relative must remain unchanged and should not it be 12?
int main() {
int x = 9;
// printf("%p is the address of x\n",&x);
fun(&x );
printf("%x is the address of x\n", (&x));
x = 3;
printf("%d x is \n",x);
return 0;
}
int fun(unsigned int *ptr) {
int j;
printf("the difference is %u\n",((unsigned int)ptr -(unsigned int) &j));
printf("the address of j is %x\n",&j);
return 0;
}
You're assuming that the compiler has packed everything (instead of putting things on specific alignment boundaries), and you're also assuming that the compiler hasn't inlined the function, or made any other optimisations or transformations.
In summary, you cannot make any assumptions about this sort of thing, or rely on any particular behaviour.
No, it should not be 12. It should not be anything. The ISO standard has very little to say about how things are arranged on the stack. An implementation has a great deal of leeway in moving things around and inserting padding for efficiency.
If you pass it through your compiler with an option to generate he assembler code (such as with gcc -S), it will become evident as to what's happening here.
fun:
pushl %ebp
movl %esp, %ebp
subl $40, %esp ;; this is quite large (see below).
movl 8(%ebp), %edx
leal -12(%ebp), %eax
movl %edx, %ecx
subl %eax, %ecx
movl %ecx, %eax
movl %eax, 4(%esp)
movl $.LC2, (%esp)
call printf
leal -12(%ebp), %eax
movl %eax, 4(%esp)
movl $.LC3, (%esp)
call printf
movl $0, %eax
leave
ret
It appears that gcc is pre-generating the next stack frame (the one going down to printf) as part of the prolog for the fun function. That's in this case. Your compiler may be doing something totally different. But the bottom line is: the implementation can do what it likes as long as it doesn't violate the standard.
That's the code from optimisation level 0, by the way, and it gives a difference of 48. When I use gcc's insane optimisation level 3, I get a difference of 4. Again, perfectly acceptable, gcc usually does some pretty impressive optimisations at that level.

Resources