Preprocessor Definitions VS Local Variables, Speed Difference

Preprocessor Definitions VS Local Variables, Speed Difference - c

I just compiled the following C code to test out the gcc optimizer (using the -O3 flag), expecting that both functions would end up generating the same set of assembly instructions:
int test1(int a, int b)
{
#define x (a*a*a+b)
#define y (a*b*a+3*b)
return x*x+x*y+y;
#undef x
#undef y
}
int test2(int a, int b)
{
int x = a*a*a+b;
int y = a*b*a+3*b;
return x*x+x*y+y;
}
But I was surprised to find that they generated slightly different assembly, and that the execution time for test1 (the code using the preprocessor instead of local variables) was a bit faster.
I've heard people say that the compiler can optimize better than humans can, and that you should tell it exactly what you want it to do; man I guess they weren't kidding. I thought the compiler was supposed to kind of guess at the programmer's intended use of local variables and replace their use if necessary... is that a false assumption?
When writing code for performance, are you better off using preprocessor definitions for the sake of readability rather than local variables? I know it looks ugly as hell, but apparently it actually makes a difference, unless I'm missing something.
Here's the assembly I got, using "gcc test.c -O3 -S". My gcc version is 4.8.2; it looks like the assembly output is the same for most versions of gcc, but not on 4.7 or 4.8 versions for some reason
test1
movl %edi, %eax
movl %edi, %edx
leal (%rsi,%rsi,2), %ecx
imull %edi, %eax
imull %esi, %edx
imull %edi, %eax
imull %edi, %edx
addl %esi, %eax
addl %ecx, %edx
leal (%rax,%rdx), %ecx
imull %ecx, %eax
addl %edx, %eax
ret
test2
movl %edi, %eax
leal (%rsi,%rsi,2), %edx
imull %edi, %eax
imull %edi, %eax
leal (%rax,%rsi), %ecx
movl %edi, %eax
imull %esi, %eax
imull %edi, %eax
addl %eax, %edx
leal (%rcx,%rdx), %eax
imull %ecx, %eax
addl %edx, %eax
ret

Trying your code at godbolt I get identical assembly for both functions with GCC, even with -O setting. Only by omitting -O flag I get different results. And this really is expected because the code is trivial to optimize.
Here is generated assembly using gcc 4.4.7 with -O flag. As you can see they are identical.
test1(int, int):
movl %edi, %eax
imull %edi, %eax
imull %eax, %edi
addl $3, %eax
imull %esi, %eax
addl %esi, %edi
leal (%rax,%rdi), %edx
imull %edi, %edx
leal (%rdx,%rax), %eax
ret
test2(int, int):
movl %edi, %eax
imull %edi, %eax
imull %eax, %edi
addl $3, %eax
imull %esi, %eax
addl %esi, %edi
leal (%rax,%rdi), %edx
imull %edi, %edx
leal (%rdx,%rax), %eax
ret

The answer is twofold:
Your statement about identical results is a misconception
I cannot reproduce your results "test1 faster than test2".
Preprocessor misconception
The results should not be identical. The preprocessor acts on (transforms) the source before it is actually compiled by the compiler with whatever options.
You can inspect the result of the preprocessor by running gcc -E main.c for example, assuming you are using a GNU compiler and your sources above are stored in a file main.c. The relevant parts become:
int test1(int a, int b)
{
return (a*a*a+b)*(a*a*a+b)+(a*a*a+b)*(a*b*a+3*b)+(a*b*a+3*b);
}
int test2(int a, int b)
{
int x = a*a*a+b;
int y = a*b*a+3*b;
return x*x+x*y+y;
}
Obviously, the first version uses roughly two times more mathematical operations than the second one. Then the compiler and its optimiser come into play …
(NB: Ideally you could analyse the number of CPU cycles generated by the assembler code. Use e.g. gcc -S main.c and look at main.s; you probably know that. Version 2 should "win" in that case.)
Runtime testing and optimising
In order to compare our results, you should post your test code. When testing you need to average out short term fluctuations and time granularity limits of your CPU. Hence you are likely to run in loops over the same code.
int i=100000000;
while (--i>0) {
int r;
r = test1(3, 4);
}
Without optimiser, test1 runs clearly about 20% slower than test2.
However, the optimiser will analyse also the calling code and can optimise away the multiple call with identical arguments or calls with unused variables (r in this case).
Therefore you must fool the compiler to effectively make the calls, alike
int r = 0;
while (--i>0) {
r += test1(3, i);
}
When I tried that, I get identical runtimes with a percent level precision. I.e. sometimes time1 is faster, sometimes time2 is faster, when I repeat the comparison several times.
You should look into the optimiser documentation to understand which optimising options you need to outsmart in your tests.
And I confirm what #Ville Krumlinde states: I get identical code for the assembly output, even with -O level optimisation (gcc 4.4.7 on my desktop). The code only contains 9 operations in assembler, which makes me believe that the optimiser "knows" enough about algebraic optimisation to simplify your formulas.
So you may just be taken by a fake optimiser effect of your test frame after all.

Related

gcc assembler not recognizing tzcntl or lzcntl though my processor does

If I compile this C function:
int function(unsigned u) {
int r = __builtin_clz(u) + __builtin_ctz(u);
return r;
}
in clang using -march=native, I get the following assembly:
pushq %rbp
movq %rsp, %rbp
lzcntl %edi, %ecx
tzcntl %edi, %eax
addl %ecx, %eax
popq %rbp
retq
and the compiled object code runs fine on my processor. (I'm on a Mac OS, so I query the CPU features using sysctl -a machdep.cpu. It reports the BMI1 capability which includes TZCNT, and also the LZCNT capability. Also if I link that into a binary and execute it I get the expected results.)
If I compile the same code using gcc using -march=native, I get similar assembly:
xorl %eax, %eax
lzcntl %edi, %eax
tzcntl %edi, %edi
addl %edi, %eax
ret
but the assembler refuses to compile it. I get no such instruction errors for both the lzcntl and the tzcntl. What do I need to do to get the assembler to cooperate? Is it feasible/reasonable to tell gcc to use a different assembler? If so, how do I do that?
I'm not trying to solve a specific problem, I'm just trying to explore in what circumstances I can use builtins for things like count-leading/trailing-zeros, popcount, and so on.

Why asm generated by gcc mov twice?

Suppose I have the following C code:
#include
int main()
{
int x = 11;
int y = x + 3;
printf("%d\n", x);
return 0;
}
Then I compile it into asm using gcc, I get this(with some flag removed):
main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
movl $11, -4(%rbp)
movl -4(%rbp), %eax
addl $3, %eax
movl %eax, -8(%rbp)
movl -4(%rbp), %eax
movl %eax, %esi
movl $.LC0, %edi
movl $0, %eax
call printf
movl $0, %eax
leave
ret
My problem is why it is movl -4(%rbp), %eax followed by movl %eax, %esi, rather than a simple movl -4(%rbp), %esi(which works well according to my experiment)?

You probably did not enable optimizations.
Without optimization the compiler will produce code like this. For one it does not allocate data to registers, but on the stack. This means that when you operate on variables they will first be transferred to a register and then operated on.
So given that x lives is allocated in -4(%rbp) and this is what the code appears as if you translate it directly without optimization. First you move 11 to the storage of x. This means:
movl $11, -4(%rbp)
done with the first statement. The next statement is to evaluate x+3 and place in the storage of y (which is -8(%rbp), this is done without regard of the previous generated code:
movl -4(%rbp), %eax
addl $3, %eax
movl %eax, -8(%rbp)
done with the second statement. By the way that is divided into two parts: evaluation of x+3 and the storage of the result. Then the compiler continues to generate code for the printf statement, again without taking earlier statements into account.
If you on the other hand would enable optimization the compiler does a number of smart and to humans obvious things. One thing is that it allows variables to be allocated to registers, or at least keep track on where one can find the value of the variable. In this case the compiler would for example know in the second statement that x is not only stored at -4(%ebp) it will also know that it is stored in $11 (yes it nows it's actual value). It can then use this to add 3 to it which means it knows the result to be 14 (but it's smarter that that - it has also seen that you didn't use that variable so it skips that statement entirely). Next statement is the printf statement and here it can use the fact that it knows x to be 11 and pass that directly to printf. By the way it also realizes that it doesn't get to use the storage of x at -4(%ebp). Finally it may know what printf does (since you included stdio.h) so can analyze the format string and do the conversion at compile time to replace the printf statement to a call that directly writes 14 to standard out.

Difference of n/=10 and n=n/10

is there any difference between n/=10 and n=n/10 in execution speed wise?
just like n-- and --n are differ in their execution speed wise also...

No, not really:
[C99: 6.5.16.2/3]: A compound assignment of the form E1 op= E2 differs from the simple assignment expression E1 = E1 op (E2) only in that the lvalue E1 is evaluated only once.
So, this has consequences only if your n is a non-trivial expression with side-effects (such as a function call).
Otherwise, I suppose in theory an intermediate temporary variable will be involved, but you'd have to be remarkably unlucky for such a temporary to actually survive in your compiled executable. You're not going to see any performance difference between the two approaches.
Confirm this with benchmarks, and by comparing the resulting assembly.

Given the following C-code:
int f1(int n) {
n /= 10;
return n;
}
int f2(int n) {
n = n / 10;
return n;
}
compiled with gcc -O4 essentially results in
f1:
movl %edi, %eax
movl $1717986919, %edx
sarl $31, %edi
imull %edx
sarl $2, %edx
subl %edi, %edx
movl %edx, %eax
ret
f2:
movl %edi, %eax
movl $1717986919, %edx
sarl $31, %edi
imull %edx
sarl $2, %edx
subl %edi, %edx
movl %edx, %eax
ret
I have omitted some boilerplate which is part of the listing in reality.
In this specific case, there is no difference between the two alternatives.
Depending on the compiler used, on the actual environment where the instructions are executed and on compiler optimization levels, the generated code might be different. But you can always use this approach to check if the resulting machine code differs or not.

There is no difference b/w the both.

I have checked it in the KEIL cross compiler for both expressions and same execution time required:
=================================================
5: x=x/5;
6:
C:0x0005 EF **MOV A,R7**
C:0x0006 75F005 **MOV B(0xF0),#0x05**
C:0x0009 84 **DIV AB**
7: x/=5;
C:0x000A 75F005 **MOV B(0xF0),#0x05**
C:0x000D 84 **DIV AB**
C:0x000E FF **MOV R7,A**
================================================
So, there is no any difference, like with --n and n--.

stack layout for process

why is the output 44 instead of 12, &x,eip,ebp,&j should be the stack layout for the code
given below i guess, if so then it must have 12 as output, but i am getting it 44?
so help me understand the concept,altough the base pointer is relocated for every instant of execution, the relative must remain unchanged and should not it be 12?
int main() {
int x = 9;
// printf("%p is the address of x\n",&x);
fun(&x );
printf("%x is the address of x\n", (&x));
x = 3;
printf("%d x is \n",x);
return 0;
}
int fun(unsigned int *ptr) {
int j;
printf("the difference is %u\n",((unsigned int)ptr -(unsigned int) &j));
printf("the address of j is %x\n",&j);
return 0;
}

You're assuming that the compiler has packed everything (instead of putting things on specific alignment boundaries), and you're also assuming that the compiler hasn't inlined the function, or made any other optimisations or transformations.
In summary, you cannot make any assumptions about this sort of thing, or rely on any particular behaviour.

No, it should not be 12. It should not be anything. The ISO standard has very little to say about how things are arranged on the stack. An implementation has a great deal of leeway in moving things around and inserting padding for efficiency.
If you pass it through your compiler with an option to generate he assembler code (such as with gcc -S), it will become evident as to what's happening here.
fun:
pushl %ebp
movl %esp, %ebp
subl $40, %esp ;; this is quite large (see below).
movl 8(%ebp), %edx
leal -12(%ebp), %eax
movl %edx, %ecx
subl %eax, %ecx
movl %ecx, %eax
movl %eax, 4(%esp)
movl $.LC2, (%esp)
call printf
leal -12(%ebp), %eax
movl %eax, 4(%esp)
movl $.LC3, (%esp)
call printf
movl $0, %eax
leave
ret
It appears that gcc is pre-generating the next stack frame (the one going down to printf) as part of the prolog for the fun function. That's in this case. Your compiler may be doing something totally different. But the bottom line is: the implementation can do what it likes as long as it doesn't violate the standard.
That's the code from optimisation level 0, by the way, and it gives a difference of 48. When I use gcc's insane optimisation level 3, I get a difference of 4. Again, perfectly acceptable, gcc usually does some pretty impressive optimisations at that level.

Decoding equivalent assembly code of C code

Wanting to see the output of the compiler (in assembly) for some C code, I wrote a simple program in C and generated its assembly file using gcc.
The code is this:
#include <stdio.h>
int main()
{
int i = 0;
if ( i == 0 )
{
printf("testing\n");
}
return 0;
}
The generated assembly for it is here (only the main function):
_main:
pushl %ebpz
movl %esp, %ebp
subl $24, %esp
andl $-16, %esp
movl $0, %eax
addl $15, %eax
addl $15, %eax
shrl $4, %eax
sall $4, %eax
movl %eax, -8(%ebp)
movl -8(%ebp), %eax
call __alloca
call ___main
movl $0, -4(%ebp)
cmpl $0, -4(%ebp)
jne L2
movl $LC0, (%esp)
call _printf
L2:
movl $0, %eax
leave
ret
I am at an absolute loss to correlate the C code and assembly code. All that the code has to do is store 0 in a register and compare it with a constant 0 and take suitable action. But what is going on in the assembly?

Since main is special you can often get better results by doing this type of thing in another function (preferably in it's own file with no main). For example:
void foo(int x) {
if (x == 0) {
printf("testing\n");
}
}
would probably be much more clear as assembly. Doing this would also allow you to compile with optimizations and still observe the conditional behavior. If you were to compile your original program with any optimization level above 0 it would probably do away with the comparison since the compiler could go ahead and calculate the result of that. With this code part of the comparison is hidden from the compiler (in the parameter x) so the compiler can't do this optimization.
What the extra stuff actually is
_main:
pushl %ebpz
movl %esp, %ebp
subl $24, %esp
andl $-16, %esp
This is setting up a stack frame for the current function. In x86 a stack frame is the area between the stack pointer's value (SP, ESP, or RSP for 16, 32, or 64 bit) and the base pointer's value (BP, EBP, or RBP). This is supposedly where local variables live, but not really, and explicit stack frames are optional in most cases. The use of alloca and/or variable length arrays would require their use, though.
This particular stack frame construction is different than for non-main functions because it also makes sure that the stack is 16 byte aligned. The subtraction from ESP increases the stack size by more than enough to hold local variables and the andl effectively subtracts from 0 to 15 from it, making it 16 byte aligned. This alignment seems excessive except that it would force the stack to also start out cache aligned as well as word aligned.
movl $0, %eax
addl $15, %eax
addl $15, %eax
shrl $4, %eax
sall $4, %eax
movl %eax, -8(%ebp)
movl -8(%ebp), %eax
call __alloca
call ___main
I don't know what all this does. alloca increases the stack frame size by altering the value of the stack pointer.
movl $0, -4(%ebp)
cmpl $0, -4(%ebp)
jne L2
movl $LC0, (%esp)
call _printf
L2:
movl $0, %eax
I think you know what this does. If not, the movl just befrore the call is moving the address of your string into the top location of the stack so that it may be retrived by printf. It must be passed on the stack so that printf can use it's address to infer the addresses of printf's other arguments (if any, which there aren't in this case).
leave
This instruction removes the stack frame talked about earlier. It is essentially movl %ebp, %esp followed by popl %ebp. There is also an enter instruction which can be used to construct stack frames, but gcc didn't use it. When stack frames aren't explicitly used, EBP may be used as a general puropose register and instead of leave the compiler would just add the stack frame size to the stack pointer, which would decrease the stack size by the frame size.
ret
I don't need to explain this.
When you compile with optimizations
I'm sure you will recompile all fo this with different optimization levels, so I will point out something that may happen that you will probably find odd. I have observed gcc replacing printf and fprintf with puts and fputs, respectively, when the format string did not contain any % and there were no additional parameters passed. This is because (for many reasons) it is much cheaper to call puts and fputs and in the end you still get what you wanted printed.

Don't worry about the preamble/postamble - the part you're interested in is:
movl $0, -4(%ebp)
cmpl $0, -4(%ebp)
jne L2
movl $LC0, (%esp)
call _printf
L2:
It should be pretty self-evident as to how this correlates with the original C code.

The first part is some initialization code, which does not make any sense in the case of your simple example. This code would be removed with an optimization flag.
The last part can be mapped to C code:
movl $0, -4(%ebp) // put 0 into variable i (located at -4(%ebp))
cmpl $0, -4(%ebp) // compare variable i with value 0
jne L2 // if they are not equal, skip to after the printf call
movl $LC0, (%esp) // put the address of "testing\n" at the top of the stack
call _printf // do call printf
L2:
movl $0, %eax // return 0 (calling convention: %eax has the return code)

Well, much of it is the overhead associated with the function. main() is just a function like any other, so it has to store the return address on the stack at the start, set up the return value at the end, etc.
I would recommend using GCC to generate mixed source code and assembler which will show you the assembler generated for each sourc eline.
If you want to see the C code together with the assembly it was converted to, use a command line like this:
gcc -c -g -Wa,-a,-ad [other GCC options] foo.c > foo.lst
See http://www.delorie.com/djgpp/v2faq/faq8_20.html
On linux, just use gcc. On Windows down load Cygwin http://www.cygwin.com/
Edit - see also this question Using GCC to produce readable assembly?
and http://oprofile.sourceforge.net/doc/opannotate.html

You need some knowledge about Assembly Language to understand assembly garneted by C compiler.
This tutorial might be helpful

See here more information. You can generate the assembly code with C comments for better understanding.
gcc -g -Wa,-adhls your_c_file.c > you_asm_file.s
This should help you a little.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight