Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I currently have two functions A and B.
When compiled without any flags, A is faster than B.
But when compiled with -O1 or -O3, B is much faster than A.
I want to port the function to other languages, so it seems like A is a better choice.
But it would be great if I could understand how -O3 managed to speed up function B. Are there any good ways of at least getting a slight understanding of the kind of optimizations done by -O3?
-O3 does the same as -O2, and also:
Inline parts of functions.
Perform function cloning to make interprocedural constant propagation stronger.
Perform loop interchange outside of graphite. This can improve cache performance on loop nest and allow further loop optimizations, like vectorization, to take place. For example, the loop:
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
for (int k = 0; k < N; k++)
c[i][j] = c[i][j] + a[i][k]*b[k][j];
is transformed to
for (int i = 0; i < N; i++)
for (int k = 0; k < N; k++)
for (int j = 0; j < N; j++)
c[i][j] = c[i][j] + a[i][k]*b[k][j];
Apply unroll and jam transformations on feasible loops. In a loop nest this unrolls the outer loop by some factor and fuses the resulting multiple inner loops.
Peels loops for which there is enough information that they do not roll much. It also turns on complete loop peeling (i.e. complete removal of loops with small constant number of iterations).
Perform predictive commoning optimization, i.e., reusing computations (especially memory loads and stores) performed in previous iterations of loops.
Split paths leading to loop backedges. This can improve dead code elimination and common subexpression elimination.
Improve cache performance on big loop bodies and allow further loop optimizations, like parallelization or vectorization, to take place.
Move branches with loop invariant conditions out of the loop, with duplicates of the loop on both branches (modified according to result of the condition).
If a loop iterates over an array with a variable stride, create another version of the loop that assumes the stride is always one. For example:
for (int i = 0; i < n; ++i)
x[i * stride] = …;
becomes:
if (stride == 1)
for (int i = 0; i < n; ++i)
x[i] = …;
else
for (int i = 0; i < n; ++i)
x[i * stride] = …;
For example, the following code:
unsigned long apply(unsigned long (*f)(unsigned long, unsigned long), unsigned long a, unsigned long b, unsigned long c) {
for (unsigned long i = 0; i < b; i++)
c = f(c, a);
return c;
}
unsigned long inc(unsigned long a, unsigned long b) { return a + 1; }
unsigned long add(unsigned long a, unsigned long b) { return apply(inc, 0, b, a); }
Optimizes the add function to:
Intel Syntax
add:
lea rax, [rsi+rdi]
ret
AT&T:
add:
leaq (%rsi,%rdi), %rax
ret
Without -O3 output is:
Intel Syntax
add:
push rbp
mov rbp, rsp
sub rsp, 16
mov QWORD PTR [rbp-8], rdi
mov QWORD PTR [rbp-16], rsi
mov rdx, QWORD PTR [rbp-8]
mov rax, QWORD PTR [rbp-16]
mov rcx, rdx
mov rdx, rax
mov esi, 0
mov edi, OFFSET FLAT:inc
call apply
leave
ret
AT&T:
add:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
movq %rdi, -8(%rbp)
movq %rsi, -16(%rbp)
movq -8(%rbp), %rdx
movq -16(%rbp), %rax
movq %rdx, %rcx
movq %rax, %rdx
movl $0, %esi
movl $inc, %edi
call apply
leave
ret
You can compare the output assembler for functions A and B using -S flag and -masm=intel.
This answer is based on GCC documentation, you can learn more from it.
The question being
Are there any good ways of at least getting a slight understanding of the kind of optimizations done by -O3?
, and the intention apparently being that the question be answered in a general sense that does not take the actual code into consideration, the best answer I see is to recommend reading the documentation for your compiler, especially the documentation on optimizations.
Although not every optimization GCC performs has a corresponding option flag, most do. The docs specify which optimizations are performed at each level in terms of those flags, and they also specify what each individual flag means. Some of the terminology used in those explanations may be unfamiliar, but you should be able to glean at least "a slight understanding". Do start reading at the very top of the optimization docs.
Related
Question
Say you have a simple function that returns a value based on a look table for example:
See edit about assumptions.
uint32_t
lookup0(uint32_t r) {
static const uint32_t tbl[] = { 0, 1, 2, 3 };
if(r >= (sizeof(tbl) / sizeof(tbl[0]))) {
__builtin_unreachable();
}
/* Can replace with: `return r`. */
return tbl[r];
}
uint32_t
lookup1(uint32_t r) {
static const uint32_t tbl[] = { 0, 0, 1, 1 };
if(r >= (sizeof(tbl) / sizeof(tbl[0]))) {
__builtin_unreachable();
}
/* Can replace with: `return r / 2`. */
return tbl[r];
}
Is there any super-optimization infrastructure or algorithm that can take go from the lookup table to the optimized ALU implementation.
Motivation
The motivation is I'm building some locks for NUMA machines and want to be able to configure my code generically. Its pretty common that in NUMA locks you will need to do cpu_id -> numa_node. I can obviously setup the lookup table during configuration, but since I'm fighting for every drop of memory bandwidth I can, I am hoping to generically reach a solution that will be able to cover most layouts.
Looking at how modern compilers do:
Neither clang or gcc are able to do this at the moment.
Clang is able to get lookup0 if you rewrite it as a switch/case statement.
lookup0(unsigned int): # #lookup0(unsigned int)
movl %edi, %eax
movl lookup0(unsigned int)::tbl(,%rax,4), %eax
retq
...
case0(unsigned int): # #case0(unsigned int)
movl %edi, %eax
retq
but can't get lookup1.
lookup1(unsigned int): # #lookup1(unsigned int)
movl %edi, %eax
movl .Lswitch.table.case1(unsigned int)(,%rax,4), %eax
retq
...
case1(unsigned int): # #case1(unsigned int)
movl %edi, %eax
movl .Lswitch.table.case1(unsigned int)(,%rax,4), %eax
retq
Gcc cant get either.
lookup0(unsigned int):
movl %edi, %edi
movl lookup0(unsigned int)::tbl(,%rdi,4), %eax
ret
lookup1(unsigned int):
movl %edi, %edi
movl lookup1(unsigned int)::tbl(,%rdi,4), %eax
ret
case0(unsigned int):
leal -1(%rdi), %eax
cmpl $2, %eax
movl $0, %eax
cmovbe %edi, %eax
ret
case1(unsigned int):
subl $2, %edi
xorl %eax, %eax
cmpl $1, %edi
setbe %al
ret
I imagine I can cover a fair amount of the necessary cases with some custom brute-force approach, but was hoping this was a solved problem.
Edit:
The only true assumption is:
All inputs are have an index in the LUT.
All values are positive (think that makes things easier) and will be true for just about any sys-config thats online.
(Edit4) Would add one more assumption. The LUT is dense. That is it covers a range [<low_bound>, <bound_bound>] but nothing outside of that range.
In my case for CPU topology, I would generally expect sizeof(LUT) >= <max_value_in_lut> but that is specific to the one example I gave and would have some counter-examples.
Edit2:
I wrote a pretty simple optimizer that does a reasonable job for the CPU topologies I've tested here. But obviously it could be a lot better.
Edit3:
There seems to be some confusion about the question/initial example (I should have been clearer).
The example lookup0/lookup1 are arbitrary. I am hoping to find a solution that can scale beyond 4 indexes and with different values.
The use case I have in mind is CPU topology so ~256 - 1024 is where I would expect the upper bound in size but for a generic LUT it could obviously get much larger.
The best "generic" solution I am aware of is the following:
int compute(int r)
{
static const int T[] = {0,0,1,1};
const int lut_size = sizeof(T) / sizeof(T[0]);
int result = 0;
for(int i=0 ; i<lut_size ; ++i)
result += (r == i) * T[i];
return result;
}
In -O3, GCC and Clang unroll the loop, propagate constants, and generate an intermediate code similar to the following:
int compute(int r)
{
return (r == 0) * 0 + (r == 1) * 0 + (r == 2) * 1 + (r == 3) * 1;
}
GCC/Clang optimizers know that multiplication can be replaced with conditional moves (since developers often use this as a trick to guide compilers generating assembly codes without conditional branches).
The resulting assembly is the following for Clang:
compute:
xor ecx, ecx
cmp edi, 2
sete cl
xor eax, eax
cmp edi, 3
sete al
add eax, ecx
ret
The same applies for GCC. There is no branches nor memory accesses (at least as long as the values are small). Multiplication by small values are also replaced with the fast lea instruction.
A more complete test is available on Godbolt.
Note that this method should work for bigger tables but if the table is too big, then the loop will not be automatically unrolled. You can tell the compiler to use a more aggressive unrolling thanks to compilation flags. That being said, a LUT will likely be faster if it is big since having a huge code to load and execute is slow in this pathological case.
You could pack the array into a long integer and use bitshifts and anding to extract the result.
For example for the table {2,0,3,1} could be handled with:
uint32_t lookup0(uint32_t r) {
static const uint32_t tbl = (2u << 0) | (0u << 8) |
(3u << 16) | (1u << 24);
return (tbl >> (8 * r)) & 0xff;
}
It produces relatively nice assembly:
lookup0: # #lookup0
lea ecx, [8*rdi]
mov eax, 16973826
shr eax, cl
movzx eax, al
ret
Not perfect but branchless and with no indirection.
This method is quite generic and it could support vectorization by "looking up" multiple inputs at the same time.
There are a few tricks to allow handling larger arrays like using longer integers (i.e. uint64_t or __uint128_t extension).
Other approach is splitting bits of value in array like high and low byte, lookup them and combine using bitwise operations.
I have the following:
foo:
movl $0, %eax //result = 0
cmpq %rsi, %rdi // rdi = x, rsi = y?
jle .L2
.L3:
addq %rdi, %rax //result = result + i?
subq $1, %rdi //decrement?
cmp %rdi, rsi
jl .L3
.L2
rep
ret
And I'm trying to translate it to:
long foo(long x, long y)
{
long i, result = 0;
for (i= ; ; ){
//??
}
return result;
}
I don't know what cmpq %rsi, %rdi mean.
Why isn't there another &eax for long i?
I would love some help in figuring this out. I don't know what I'm missing - I been going through my notes, textbook, and rest of the internet and I am stuck. It's a review question, and I've been at it for hours.
Assuming this is a function taking 2 parameters. Assuming this is using the gcc amd64 calling convention, it will pass the two parameters in rdi and rsi. In your C function you call these x and y.
long foo(long x /*rdi*/, long y /*rsi*/)
{
//movl $0, %eax
long result = 0; /* rax */
//cmpq %rsi, %rdi
//jle .L2
if (x > y) {
do {
//addq %rdi, %rax
result += x;
//subq $1, %rdi
--x;
//cmp %rdi, rsi
//jl .L3
} while (x > y);
}
return result;
}
I don't know what cmpq %rsi, %rdi mean
That's AT&T syntax for cmp rdi, rsi. https://www.felixcloutier.com/x86/CMP.html
You can look up the details of what a single instruction does in an ISA manual.
More importantly, cmp/jcc like cmp %rsi,%rdi/jl is like jump if rdi<rsi.
Assembly - JG/JNLE/JL/JNGE after CMP. If you go through all the details of how cmp sets flags, and which flags each jcc condition checks, you can verify that it's correct, but it's much easier to just use the semantic meaning of JL = Jump on Less-than (assuming flags were set by a cmp) to remember what they do.
(It's reversed because of AT&T syntax; jcc predicates have the right semantic meaning for Intel syntax. This is one of the major reasons I usually prefer Intel syntax, but you can get used to AT&T syntax.)
From the use of rdi and rsi as inputs (reading them without / before writing them), they're the arg-passing registers. So this is the x86-64 System V calling convention, where integer args are passed in RDI, RSI, RDX, RCX, R8, R9, then on the stack. (What are the calling conventions for UNIX & Linux system calls on i386 and x86-64 covers function calls as well as system calls). The other major x86-64 calling convention is Windows x64, which passes the first 2 args in RCX and RDX (if they're both integer types).
So yes, x=RDI and y=RSI. And yes, result=RAX. (writing to EAX zero-extends into RAX).
From the code structure (not storing/reloading every C variable to memory between statements), it's compiled with some level of optimization enabled, so the for() loop turned into a normal asm loop with the conditional branch at the bottom. Why are loops always compiled into "do...while" style (tail jump)? (#BrianWalker's answer shows the asm loop transliterated back to C, with no attempt to form it back into an idiomatic for loop.)
From the cmp/jcc ahead of the loop, we can tell that the compiler can't prove the loop runs a non-zero number of iterations. So whatever the for() loop condition is, it might be false the first time. (That's unsurprising given signed integers.)
Since we don't see a separate register being used for i, we can conclude that optimization reused another var's register for i. Like probably for(i=x;, and then with the original value of x being unused for the rest of the function, it's "dead" and the compiler can just use RDI as i, destroying the original value of x.
I guessed i=x instead of y because RDI is the arg register that's modified inside the loop. We expect that the C source modifies i and result inside the loop, and presumably doesn't modify it's input variables x and y. It would make no sense to do i=y and then do stuff like x--, although that would be another valid way of decompiling.
cmp %rdi, %rsi / jl .L3 means the loop condition to (re)enter the loop is rsi-rdi < 0 (signed), or i<y.
The cmp/jcc before the loop is checking the opposite condition; notice that the operands are reversed and it's checking jle, i.e. jng. So that makes sense, it really is same loop condition peeled out of the loop and implemented differently. Thus it's compatible with the C source being a plain for() loop with one condition.
sub $1, %rdi is obviously i-- or --i. We can do that inside the for(), or at the bottom of the loop body. The simplest and most idiomatic place to put it is in the 3rd section of the for(;;) statement.
addq %rdi, %rax is obviously adding i to result. We already know what RDI and RAX are in this function.
Putting the pieces together, we arrive at:
long foo(long x, long y)
{
long i, result = 0;
for (i= x ; i>y ; i-- ){
result += i;
}
return result;
}
Which compiler made this code?
From the .L3: label names, this looks like output from gcc. (Which somehow got corrupted, removing the : from .L2, and more importantly removing the % from %rsi in one cmp. Make sure you copy/paste code into SO questions to avoid this.)
So it may be possible with the right gcc version/options to get exactly this asm back out for some C input. It's probably gcc -O1, because movl $0, %eax rules out -O2 and higher (where GCC would look for the xor %eax,%eax peephole optimization for zeroing a register efficiently). But it's not -O0 because that would be storing/reloading the loop counter to memory. And -Og (optimize a bit, for debugging) likes to use a jmp to the loop condition instead of a separate cmp/jcc to skip the loop. This level of detail is basically irrelevant for simply decompiling to C that does the same thing.
The rep ret is another sign of gcc; gcc7 and earlier used this in their default tune=generic output for ret that's reached as a branch target or a fall-through from a jcc, because of AMD K8/K10 branch prediction. What does `rep ret` mean?
gcc8 and later will still use it with -mtune=k8 or -mtune=barcelona. But we can rule that out because that tuning option would use dec %rdi instead of subq $1, %rdi. (Only a few modern CPUs have any problems with inc/dec leaving CF unmodified, for register operands. INC instruction vs ADD 1: Does it matter?)
gcc4.8 and later put rep ret on the same line. gcc4.7 and earlier print it as you've shown, with the rep prefix on the line before.
gcc4.7 and later like to put the initial branch before the mov $0, %eax, which looks like a missed optimization. It means they need a separate return 0 path out of the function, which contains another mov $0, %eax.
gcc4.6.4 -O1 reproduces your output exactly, for the source shown above, on the Godbolt compiler explorer
# compiled with gcc4.6.4 -O1 -fverbose-asm
foo:
movl $0, %eax #, result
cmpq %rsi, %rdi # y, x
jle .L2 #,
.L3:
addq %rdi, %rax # i, result
subq $1, %rdi #, i
cmpq %rdi, %rsi # i, y
jl .L3 #,
.L2:
rep
ret
So does this other version which uses i=y. Of course there are many things we could add that would optimize away, like maybe i=y+1 and then having a loop condition like x>--i. (Signed overflow is undefined behaviour in C, so the compiler can assume it doesn't happen.)
// also the same asm output, using i=y but modifying x in the loop.
long foo2(long x, long y) {
long i, result = 0;
for (i= y ; x>i ; x-- ){
result += x;
}
return result;
}
In practice the way I actually reversed this:
I copy/pasted the C template into Godbolt (https://godbolt.org/). I could see right away (from the mov $0 instead of xor-zero, and from the label names) that it looked like gcc -O1 output, so I put in that command line option and picked an old-ish version of gcc like gcc6. (Turns out this asm was actually from a much older gcc).
I tried an initial guess like x<y based on the cmp/jcc, and i++ (before I'd actually read the rest of the asm carefully at all), because for loops often use i++. The trivial-looking infinite-loop asm output showed me that was obviously wrong :P
I guessed that i=x, but after taking a wrong turn with a version that did result += x but i--, I realized that i was a distraction and at first simplified by not using i at all. I just used x-- while first reversing it because obviously RDI=x. (I know the x86-64 System V calling convention well enough to see that instantly.)
After looking at the loop body, the result += x and x-- were totally obvious from the add and sub instructions.
cmp/jl was obviously a something < something loop condition involving the 2 input vars.
I wasn't sure I if it was x<y or y<x, and newer gcc versions were using jne as the loop condition. I think at that point I cheated and looked at Brian's answer to check it really was x > y, instead of taking a minute to work through the actual logic. But once I had figured out it was x--, only x>y made sense. The other one would be true until wraparound if it entered the loop at all, but signed overflow is undefined behaviour in C.
Then I looked at some older gcc versions to see if any made asm more like in the question.
Then I went back and replaced x with i inside the loop.
If this seems kind of haphazard and slapdash, that's because this loop is so tiny that I didn't expect to have any trouble figuring it out, and I was more interested in finding source + gcc version that exactly reproduced it, rather than the original problem of just reversing it at all.
(I'm not saying beginners should find it that easy, I'm just documenting my thought process in case anyone's curious.)
A developer can use the __builtin_expect builtin to help the compiler understand in which direction a branch is likely to go.
In the future, we may get a standard attribute for this purpose, but as of today at least all of clang, icc and gcc support the non-standard __builtin_expect instead.
However, icc seems to generate oddly terrible code when you use it1. That is, code that is uses the builtin is strictly worse than the code without it, regardless of which direction the prediction is made.
Take for example the following toy function:
int foo(int a, int b)
{
do {
a *= 77;
} while (b-- > 0);
return a * 77;
}
Out of the three compilers, icc is the only one that compiles this to the optimal scalar loop of 3 instructions:
foo(int, int):
..B1.2: # Preds ..B1.2 ..B1.1
imul edi, edi, 77 #4.6
dec esi #5.12
jns ..B1.2 # Prob 82% #5.18
imul eax, edi, 77 #6.14
ret
Both gcc and Clang manage the miss the easy solution and use 5 instructions.
On the other hand, when you use likely or unlikely macros on the loop condition, icc goes totally braindead:
#define likely(x) __builtin_expect((x), 1)
#define unlikely(x) __builtin_expect((x), 0)
int foo(int a, int b)
{
do {
a *= 77;
} while (likely(b-- > 0));
return a * 77;
}
This loop is functionally equivalent to the previous loop (since __builtin_expect just returns its first argument), yet icc produces some awful code:
foo(int, int):
mov eax, 1 #9.12
..B1.2: # Preds ..B1.2 ..B1.1
xor edx, edx #9.12
test esi, esi #9.12
cmovg edx, eax #9.12
dec esi #9.12
imul edi, edi, 77 #8.6
test edx, edx #9.12
jne ..B1.2 # Prob 95% #9.12
imul eax, edi, 77 #11.15
ret #11.15
The function has doubled in size to 10 instructions, and (worse yet!) the critical loop has more than doubled to 7 instructions with a long critical dependency chain involving a cmov and other weird stuff.
The same is true if you use the unlikely hint and also across all icc versions (13, 14, 17) that godbolt supports. So the code generation is strictly worse, regardless of the hint, and regardless of the actual runtime behavior.
Neither gcc nor clang suffer any degradation when hints are used.
What's up with that?
1 At least in the first and subsequent examples I tried.
To me it seems an ICC bug. This code (available on godbolt)
int c;
do
{
a *= 77;
c = b--;
}
while (likely(c > 0));
that simply use an auxiliary local var c, produces an output without the edx = !!(esi > 0) pattern
foo(int, int):
..B1.2:
mov eax, esi
dec esi
imul edi, edi, 77
test eax, eax
jg ..B1.2
still not optimal (it could do without eax), though.
I don't know if the official ICC policy about __builtin_expect is full support or just compatibility support.
This question seems better suited for the Official ICC forum.
I've tried posting this topic there but I'm not sure I've made a good job (I've been spoiled by SO).
If they answer me I'll update this answer.
EDIT
I've got and an answer at the Intel Forum, they recorded this issue in their tracking system.
As today, it seems a bug.
Don't let the instructions deceive you. What matters is performance.
Consider this rather crude test :
#include "stdafx.h"
#include <windows.h>
#include <iostream>
int foo(int a, int b) {
do { a *= 7; } while (b-- > 0);
return a * 7;
}
int fooA(int a, int b) {
__asm {
mov esi, b
mov edi, a
mov eax, a
B1:
imul edi, edi, 7
dec esi
jns B1
imul eax, edi, 7
}
}
int fooB(int a, int b) {
__asm {
mov esi, b
mov edi, a
mov eax, 1
B1:
xor edx, edx
test esi, esi
cmovg edx, eax
dec esi
imul edi, edi, 7
test edx, edx
jne B1
imul eax, edi, 7
}
}
int main() {
DWORD start = GetTickCount();
int j = 0;
for (int aa = -10; aa < 10; aa++) {
for (int bb = -500; bb < 15000; bb++) {
j += foo(aa, bb);
}
}
std::cout << "foo compiled (/Od)\n" << "j = " << j << "\n"
<< GetTickCount() - start << "ms\n\n";
start = GetTickCount();
j = 0;
for (int aa = -10; aa < 10; aa++) {
for (int bb = -500; bb < 15000; bb++) {
j += fooA(aa, bb);
}
}
std::cout << "optimal scalar\n" << "j = " << j << "\n"
<< GetTickCount() - start << "ms\n\n";
start = GetTickCount();
j = 0;
for (int aa = -10; aa < 10; aa++) {
for (int bb = -500; bb < 15000; bb++) {
j += fooB(aa, bb);
}
}
std::cout << "use likely \n" << "j = " << j << "\n"
<< GetTickCount() - start << "ms\n\n";
std::cin.get();
return 0;
}
produces output:
foo compiled (/Od)
j = -961623752
4422ms
optimal scalar
j = -961623752
1656ms
use likely
j = -961623752
1641ms
This is naturally entirely CPU dependent (tested here on Haswell i7), but both asm loops generally are very nearly identical in performance when tested over a range of inputs. A lot of this has to do with the selection and ordering of instructions being conducive to leveraging instruction pipelining (latency), branch prediction, and other hardware optimizations in the CPU.
The real lesson when you're optimizing is that you need to profile - it's extremely difficult to do this by inspection of the raw assembly.
Even giving a challenging test where likely(b-- >0) isn't true over a third of the time :
for (int aa = -10000000; aa < 10000000; aa++) {
for (int bb = -3; bb < 9; bb++) {
j += fooX(aa, bb);
}
}
results in :
foo compiled (/Od) : 1844ms
optimal scalar : 906ms
use likely : 1187ms
Which isn't bad. What you have to keep in mind is that the compiler will generally do its best without your interference. Using __builtin_expect and the like should really be restricted to cases where you have existing code that you have profiled and that you have specifically identified as being both hotspots and as having pipeline or prediction issues. This trivial example is an ideal case where the compiler will almost certainly do the right thing without help from you.
By including __builtin_expect you're asking the compiler to necessarily compile in a different way - a more complex way, in terms of pure number of instructions, but a more intelligent way in that it structures the assembly in a way that helps the CPU make better branch predictions. In this case of pure register play (as in this example) there's not much at stake, but if it improves prediction in a more complex loop, maybe saving you a bad misprediction, cache misses, and related collateral damage, then it's probably worth using.
I think it's pretty clear here, at least, that when the branch actually is likely then we very nearly recover the full performance of the optimal loop (which I think is impressive). In cases where the "optimal loop" is rather more complex and less trivial we can expect that the codegen would indeed improve branch prediction rates (which is what this is really all about). I think this is really a case of if you don't need it, don't use it.
On the topic of likely vs unlikely generating the same assembly, this doesn't imply that the compiler is broken - it just means that the same codegen is effective regardless of whether the branch is mostly taken or mostly not taken - as long as it is mostly something, it's good (in this case). The codegen is designed to optimise use of the instruction pipeline and to assist branch prediction, which it does. While we saw some reduction in performance with the mixed case above, pushing the loop to mostly unlikely recovers performance.
for (int aa = -10000000; aa < 10000000; aa++) {
for (int bb = -30; bb < 1; bb++) {
j += fooX(aa, bb);
}
}
foo compiled (/Od) : 2453ms
optimal scalar : 1968ms
use likely : 2094ms
So I have a question regarding performance on two different code techniques. Can you help me understanding which one is faster/better and why?
Here is the first technique:
int x, y, i;
for(i=0; i<10; i++)
{
//do stuff with x and y
}
//reset x and y to zero
x=0;
y=0;
And here is the second one:
int i;
for(i=0; i<10; i++)
{
int x, y;
//do the same stuff with x and y as above
}
So which coding technique is better?
Also if you know a better one and/or any site/article etc. where I can read about this and more performance related stuff I would love to have that also!
It does not matter at all, because compilers don't automatically translate variable declaration to memory or register allocation. The difference between the two samples is that in the first case the variables are visible outside of the loop body, and in the second case they are not. However this difference is at the C level only, and if you don't use the variables outside the loop it will result in the same compiled code.
The compiler has two options for where to store a local variable : it's either on the stack or in a register. For each variable you use in your program, the compiler has to choose where it is going to live. If on the stack, then it needs to decrement the stack pointer to make room for the variable. But this decrementation will not happen at the place of variable declaration, typically it will be done at the beginning of the function : the stack pointer will be decremented only once by an amount sufficient to hold all of the stack-allocated variables. If it's only going to be in a register, no initialization needs to be done and the register will be used as destination when you first do an assignment. The important thing is that it can and will re-use memory locations and registers that were previously used for variables which are now out of scope.
For illustration, I made two test programs. I used 10000 iterations instead of 10 because otherwise the compiler would unroll the loop at high optimization levels. The programs use rand to make for a quick and portable demo, but it should not be used in production code.
declare_once.c :
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
int main(void) {
srand(time(NULL));
int x, y, i;
for (i = 0; i < 10000; i++) {
x = rand();
y = rand();
printf("Got %d and %d !\n", x, y);
}
return 0;
}
redeclare.c is the same except for the loop which is :
for (i = 0; i < 10000; i++) {
int x, y;
x = rand();
y = rand();
printf("Got %d and %d !\n", x, y);
}
I compiled the programs using Apple's LLVM version 7.3.0 on x86_64 Mac. I asked it for assembly output which I reproduced below, leaving out the parts unrelated to the question.
clang -O0 -S declare_once.c -o declare_once.S :
_main:
## Function prologue
pushq %rbp
movq %rsp, %rbp ## Move the old value of the stack
## pointer (%rsp) to the base pointer
## (%rbp), which will be used to
## address stack variables
subq $32, %rsp ## Decrement the stack pointer by 32
## to make room for up to 32 bytes
## worth of stack variables including
## x and y
## Removed code that calls srand
movl $0, -16(%rbp) ## i = 0. i has been assigned to the 4
## bytes starting at address -16(%rbp),
## which means 16 less than the base
## pointer (so here, 16 more than the
## stack pointer).
LBB0_1:
cmpl $10, -16(%rbp)
jge LBB0_4
callq _rand ## Call rand. The return value will be in %eax
movl %eax, -8(%rbp) ## Assign the return value of rand to x.
## x has been assigned to the 4 bytes
## starting at -8(%rbp)
callq _rand
leaq L_.str(%rip), %rdi
movl %eax, -12(%rbp) ## Assign the return value of rand to y.
## y has been assigned to the 4 bytes
## starting at -12(%rbp)
movl -8(%rbp), %esi
movl -12(%rbp), %edx
movb $0, %al
callq _printf
movl %eax, -20(%rbp)
movl -16(%rbp), %eax
addl $1, %eax
movl %eax, -16(%rbp)
jmp LBB0_1
LBB0_4:
xorl %eax, %eax
addq $32, %rsp ## Add 32 to the stack pointer :
## deallocate all stack variables
## including x and y
popq %rbp
retq
The assembly output for redeclare.c is almost exactly the same, except that for some reason x and y get assigned to -16(%rbp) and -12(%rbp) respectively, and i gets assigned to -8(%rbp). I copy-pasted only the loop :
movl $0, -16(%rbp)
LBB0_1:
cmpl $10, -16(%rbp)
jge LBB0_4
callq _rand
movl %eax, -8(%rbp) ## x = rand();
callq _rand
leaq L_.str(%rip), %rdi
movl %eax, -12(%rbp) ## y = rand();
movl -8(%rbp), %esi
movl -12(%rbp), %edx
movb $0, %al
callq _printf
movl %eax, -20(%rbp)
movl -16(%rbp), %eax
addl $1, %eax
movl %eax, -16(%rbp)
jmp LBB0_1
So we see that even at -O0 the generated code is the same. The important thing to note is that the same memory locations are reused for x and y in each loop iteration, even though they are separate variables at each iteration from the C language point of view.
At -O3 the variables are kept in registers, and both programs output the exact same assembly.
clang -O3 -S declare_once.c -o declare_once.S :
movl $10000, %ebx ## i will be in %ebx. The compiler decided
## to count down from 10000 because
## comparisons to 0 are less expensive,
## so it actually does i = 10000.
leaq L_.str(%rip), %r14
.align 4, 0x90
LBB0_1:
callq _rand
movl %eax, %r15d ## x = rand(). x has been assigned to
## register %r15d (32 less significant
## bits of r15)
callq _rand
movl %eax, %ecx ## y = rand(). y has been assigned to
## register %ecx
xorl %eax, %eax
movq %r14, %rdi
movl %r15d, %esi
movl %ecx, %edx
callq _printf
decl %ebx
jne LBB0_1
So again, no differences between the two versions, and even though in redeclare.c we have different variables at each iteration, the same registers are re-used so that there is no allocation overhead.
Keep in mind that everything I said applies to variables that are assigned in each loop iteration, which seems to be what you were thinking. If on the other hand you want to use the same values for all iterations, of course the assignment should be done before the loop.
Declaring the variables in the inner-most scope where you'll use them:
int i;
for(i=0; i<10; i++)
{
int x, y;
//do the same stuff with x and y as above
}
is always going to be preferred. The biggest improvement is that you've limited the scope of the x and y variables. This prevents you from accidentally using them where you didn't intend to.
Even if you use "the same" variables again:
int i;
for(i=0; i<10; i++)
{
int x, y;
//do the same stuff with x and y as above
}
for(i=0; i<10; i++)
{
int x, y;
//do the same stuff with x and y as above
}
there will be no performance impact whatsoever. The statement int x, y has practically no effect at runtime.
Most modern compilers will calculate the total size of all local variables, and emit code to reserve the space on the stack (e.g. sub esp, 90h) once in the function prologue. The space for these variables will almost certainly be re-used from one "version" of x to the next. It's purely a lexical construct that the compiler uses to keep you from using that "space" on the stack where you didn't intend to.
It should not matter because you need to initialize the variables in either case. Additionally, the first case sets x and y after they are no longer being used. As a result, the reset is not needed.
Here is the first technique:
int x=0, y=0, i;
for(i=0; i<10; i++)
{
//do stuff with x and y
// x and y stay at the value they get set to during the pass
}
// x and y need to be reset if you want to use them again.
// or would retain whatever they became during the last pass.
If you had wanted x and y to be reset to 0 inside the loop, then you would need to say
Here is the first technique:
int x, y, i;
for(i=0; i<10; i++)
{
//reset x and y to zero
x=0;
y=0;
//do stuff with x and y
// Now x and y get reset before the next pass
}
The second procedure makes x and y local in scope so they are dropped at the end of the last pass. The values retain whatever they were set for during each pass for the next pass. The compiler will actually set up the variables and initialize them them at compile time not at run time. Thus you will not be defining (and initializing) the variable for each pass through the loop.
And here is the second one:
int i;
for(i=0; i<10; i++)
{
int x=0, y=0;
//do the same stuff with x and y as above
// Usually x and y only saet to 0 at start of first pass.
}
Best Practices
So which coding technique is better?
As others have pointed out, given a sufficiently mature/modern compiler the performance aspect will likely be null due to optimization. Instead, the preferred code is determined by virtue of sets of ideas known as best practices.
Limiting Scope
"Scope" describes the range of access in your code. Assuming the intended scope is to be limited to within the loop itself, x and y should be declared inside the loop as the compiler will prevent you from using them later on in your function. However, in your OP you show them being reset, which implies they will be used again later for other purposes. In this case, you must declare them towards the top (e.g. outside the loop) so you can use them later.
Here's some code you can use to demonstrate the limiting of the scope:
#include <stdio.h>
#define IS_SCOPE_LIMITED
int main ( void )
{
int i;
#ifndef IS_SCOPE_LIMITED
int x, y; // compiler will not complain, scope is generous
#endif
for(i=0; i<10; i++)
{
#ifdef IS_SCOPE_LIMITED
int x, y; // compiler will complain about use outside of loop
#endif
x = i;
y = x+1;
y++;
}
printf("X is %d and Y is %d\n", x, y);
}
To test the scope, comment out the #define towards the top. Compile with gcc -Wall loopVars.c -o loopVars and run with ./loopVars.
Benchmarking and Profiling
If you're still concerned about performance, possibly because you have some obscure operations involving these variables, then test, test, and test again! (try benchmarking or profiling your code). Again, with optimizations you probably won't find significant (if any) differences because the compiler will have done all this (allocation of variable space) prior to runtime.
UPDATE
To demonstrate this another way, you could remove the #ifdef and the #ifndef from the code (also removing each #endif), and add a line immediately preceding the printf such as x=2; y=3;. What you will find is the code will compile and run but the output will be "X is 2 and Y is 3". This is legal because the two scopes prevent the identically-named variables from competing with each other. Of course, this is a bad idea because you now have multiple variables within the same piece of code with identical names and with more complex code this will not be as easy to read and maintain.
In the specific case of int variables, it makes little (or no) difference.
For variables of more complex types, especially something with a constructor that (for example) allocates some memory dynamically, re-creating the variable every iteration of a loop may be substantially slower than re-initializing it instead. For example:
#include <vector>
#include <chrono>
#include <numeric>
#include <iostream>
unsigned long long versionA() {
std::vector<int> x;
unsigned long long total = 0;
for (int j = 0; j < 1000; j++) {
x.clear();
for (int i = 0; i < 1000; i++)
x.push_back(i);
total += std::accumulate(x.begin(), x.end(), 0ULL);
}
return total;
}
unsigned long long versionB() {
unsigned long long total = 0;
for (int j = 0; j < 1000; j++) {
std::vector<int> x;
for (int i = 0; i < 1000; i++)
x.push_back(i);
total += std::accumulate(x.begin(), x.end(), 0ULL);
}
return total;
}
template <class F>
void timer(F f) {
using namespace std::chrono;
auto start = high_resolution_clock::now();
auto result = f();
auto stop = high_resolution_clock::now();
std::cout << "Result: " << result << "\n";
std::cout << "Time: " << duration_cast<microseconds>(stop - start).count() << "\n";
}
int main() {
timer(versionA);
timer(versionB);
}
At least when I run it, there's a fairly substantial difference between the two methods:
Result: 499500000
Time: 5114
Result: 499500000
Time: 13196
In this case, creating a new vector every iteration takes more than twice as long as clearing an existing vector every iteration instead.
For what it's worth, there are probably two separate factors contributing to the speed difference:
initial creation of the vector.
Re-allocating memory as elements are added to the vector.
When we clear() a vector, that removes the existing elements, but retains the memory that's currently allocated, so in a case like this were we use the same size every iteration of the outer loop, the version that just resets the vector doesn't need to allocate any memory on subsequent iterations. If we add x.reserve(1000); immediately after defining the vector in vesionA, the difference shrinks substantially (at least in my testing not quite tied in speed, but pretty close).
Consider the following two programs that perform the same computations in two different ways:
// v1.c
#include <stdio.h>
#include <math.h>
int main(void) {
int i, j;
int nbr_values = 8192;
int n_iter = 100000;
float x;
for (j = 0; j < nbr_values; j++) {
x = 1;
for (i = 0; i < n_iter; i++)
x = sin(x);
}
printf("%f\n", x);
return 0;
}
and
// v2.c
#include <stdio.h>
#include <math.h>
int main(void) {
int i, j;
int nbr_values = 8192;
int n_iter = 100000;
float x[nbr_values];
for (i = 0; i < nbr_values; ++i) {
x[i] = 1;
}
for (i = 0; i < n_iter; i++) {
for (j = 0; j < nbr_values; ++j) {
x[j] = sin(x[j]);
}
}
printf("%f\n", x[0]);
return 0;
}
When I compile them using gcc 4.7.2 with -O3 -ffast-math and run on a Sandy Bridge box, the second program is twice as fast as the first one.
Why is that?
One suspect is the data dependency between successive iterations of the i loop in v1. However, I don't quite see what the full explanation might be.
(Question inspired by Why is my python/numpy example faster than pure C implementation?)
EDIT:
Here is the generated assembly for v1:
movl $8192, %ebp
pushq %rbx
LCFI1:
subq $8, %rsp
LCFI2:
.align 4
L2:
movl $100000, %ebx
movss LC0(%rip), %xmm0
jmp L5
.align 4
L3:
call _sinf
L5:
subl $1, %ebx
jne L3
subl $1, %ebp
.p2align 4,,2
jne L2
and for v2:
movl $100000, %r14d
.align 4
L8:
xorl %ebx, %ebx
.align 4
L9:
movss (%r12,%rbx), %xmm0
call _sinf
movss %xmm0, (%r12,%rbx)
addq $4, %rbx
cmpq $32768, %rbx
jne L9
subl $1, %r14d
jne L8
Ignore the loop structure all together, and only think about the sequence of calls to sin. v1 does the following:
x <-- sin(x)
x <-- sin(x)
x <-- sin(x)
...
that is, each computation of sin( ) cannot begin until the result of the previous call is available; it must wait for the entirety of the previous computation. This means that for N calls to sin, the total time required is 819200000 times the latency of a single sin evaluation.
In v2, by contrast, you do the following:
x[0] <-- sin(x[0])
x[1] <-- sin(x[1])
x[2] <-- sin(x[2])
...
notice that each call to sin does not depend on the previous call. Effectively, the calls to sin are all independent, and the processor can begin on each as soon as the necessary register and ALU resources are available (without waiting for the previous computation to be completed). Thus, the time required is a function of the throughput of the sin function, not the latency, and so v2 can finish in significantly less time.
I should also note that DeadMG is right that v1 and v2 are formally equivalent, and in a perfect world the compiler would optimize both of them into a single chain of 100000 sin evaluations (or simply evaluate the result at compile time). Sadly, we live in an imperfect world.
In the first example, it runs 100000 loops of sin, 8192 times.
In the second example, it runs 8192 loops of sin, 100000 times.
Other than that and storing the result differently, I don't see any difference.
However, what does make a difference is that the input is being changed for each loop in the second case. So I suspect what happens is that the sin value, at certain times in the loop, gets much easier to calculate. And that can make a big difference. Calculating sin is not entirely trivial, and it's a series calculation that loops until the exit condition is hit.