Why does compiling with -Os make this function larger?

Why does compiling with -Os make this function larger? - c

Consider this function:
long foo(long x) {
return 5*x + 6;
}
When I compile it with x86-64 gcc 8.2 with -O3 (or -O2 or -O1), it compiles into this:
foo:
leaq 6(%rdi,%rdi,4), %rax # 5 bytes: 48 8d 44 bf 06
ret # 1 byte: c3
When I use -Os instead, it compiles into this:
foo:
leaq (%rdi,%rdi,4), %rax # 4 bytes: 48 8d 04 bf
addq $6, %rax # 4 bytes: 48 83 c0 06
ret # 1 byte: c3
The latter is 3 bytes longer. Isn't -Os supposed to produce the smallest possible code even if something larger would be more efficient? Why does the opposite seem to be happening here?
Godbolt: https://godbolt.org/z/jzNquk

While -Os ("optimize for size") is expected to produce code more compact compared to code produced with the -O1, -O2 and -O3 options ("optimize for speed"), there's indeed no such guarantee, as commented by #Robert Harvey.
Optimizing compilation is a very complex and delicate process.
It consists of dozens of different optimization phases, which are usually executed serially: each optimization phase does its work on the program tree representation, and prepares the ground for the next phase. During the optimization process, every decision made in one phase may be impactful on the optimizations down the road, and passes may interact in non trivial ways, which could be very hard to predict. The compiler employs different heuristics for producing the most optimal code, but in some cases, these heuristics fall short, as in this case.
In this example, it seems things start off as expected - with -Os producing the more compact intermediate code, but this changes later on. One of the first phases to be executed by GCC is the Expand phase, which translates the GCC high level tree representation called GIMPLE, into the lower level RTL representation. It produces RTL code similar to this:
O3:
tmp1 <- x
tmp2 <- tmp1 << 2
tmp3 <- tmp2 + x
retval <- tmp3 + 6
Os:
tmp <- x * 5
tmp2 <- tmp + 6
retval <- tmp2
So far, so good - -Os wins. But afterwards, some 15 phases later, the Combine phase is executed, which attempts to combine a sequence of instructions into one instruction. For the -O3 code, Combine is able to collapse it very cleverly to the leaq instruction in the final output, but for -Os, Combine doesn't do as much good, and not able to collapse the code further. From that point, the code doesn't change much by farther optimizations.
To answer the exact question - why does GCC do this (generate the code it does during Expand with -O3, and why Combine doesn't do a better job in -Os), one has to examine the GCC code and figure out which GCC parameters are the influential ones, as well as the decisions made by the preceding optimization phases.
But, the thing is, that while GCC under performed in this example, it may be the best choice for the majority of other examples. It's a matter of delicate trade offs - not an easy job for compiler writers!
This may not answer the question fully, but hopefully it gives some useful background. If you're interested in inspecting the outputs from GCC on each optimization phase, you could add the -da compilation flag, which will produce annotated tree dumps for every phase, and the -dP flag, which adds tree annotation to the generated assembly output, along with -S.

Related

Does the order of cases in a switch statement affect performance?

I have a switch case program:
Ascending order switch cases :
int main()
{
int a, sc = 1;
switch (sc)
{
case 1:
a = 1;
break;
case 2:
a = 2;
break;
}
}
Assembly of code:
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 1
mov eax, DWORD PTR [rbp-4]
cmp eax, 1
je .L3
cmp eax, 2
je .L4
jmp .L2
.L3:
mov DWORD PTR [rbp-8], 1
jmp .L2
.L4:
mov DWORD PTR [rbp-8], 2
nop
.L2:
mov eax, 0
pop rbp
ret
Descending order switch cases:
int main()
{
int a, sc = 1;
switch (sc)
{
case 2:
a = 1;
break;
case 1:
a = 2;
break;
}
}
Assembly of code:
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 1
mov eax, DWORD PTR [rbp-4]
cmp eax, 1
je .L3
cmp eax, 2
jne .L2
mov DWORD PTR [rbp-8], 1
jmp .L2
.L3:
mov DWORD PTR [rbp-8], 2
nop
.L2:
mov eax, 0
pop rbp
ret
Here, ascending order cases generated more assembly than descending order.
So, if I have more numbers of switch cases, then does the order of cases affect performance?

You're looking at unoptimized code, so studying it for performance isn't very meaningful. If you look at optimized code for your examples, you'll find that it doesn't do the comparisons at all! The optimizer notices that the switch variable sc always has the value 1, so it removes the unreachable case 2.
The optimizer also sees that the variable a isn't used after it's assigned, so it removes the code in case 1 as well, leaving main() an empty function. And it removes the function prolog/epilog that manipulates rbp since that register is unused.
So the optimized code ends up the same for either version of your main() function:
main:
xor eax, eax
ret
In short, for the code in the question, it doesn't matter which order you put the case statements, because none of that code will be generated at all.
Would the case order matter in a more real-life example where the code actually is generated and used? Probably not. Note that even in your unoptimized generated code, both versions test for the two case values in numeric order, checking first for 1 and then for 2, regardless of the order in the source code. Clearly the compiler is doing some sorting even in the unoptimized code.
Be sure to note Glenn and Lundin's comments: the order of the case sections is not the only change between your two examples, the actual code is different too. In one of them, the case values match the values set into a, but not so in the other.
Compilers use various strategies for switch/case statements depending on the actual values used. They may use a series of comparisons as in these examples, or perhaps a jump table. It can be interesting to study the generated code, but as always, if performance matters, watch your optimization settings and test it in a real-life situation.

Compiler optimization of switch statements is tricky. Of course, you need to enable optimizations (e.g. try to compile your code with gcc -O2 -fverbose-asm -S with GCC and look inside the generated .s assembler file). BTW on both of your examples my GCC 7 on Debian/Sid/x86-64 gives simply:
.type main, #function
main:
.LFB0:
.cfi_startproc
# rsp.c:13: }
xorl %eax, %eax #
ret
.cfi_endproc
(so there in no trace of switch in that generated code)
If you need to understand how a compiler could optimize switch, there are some papers on that subject, such as this one.
If I have more numbers of switch cases, then an order of cases effect on performance?
Not in general, if you are using some optimizing compiler and asking it to optimize. See also this.
If that matters to you so much (but it should not, leave micro-optimizations to your compiler!), you need to benchmark, to profile and perhaps to study the generated assembler code. BTW, cache misses and register allocation could matter much more than order of case-s so I think you should not bother at all. Keep in mind the approximate timing estimates of recent computers. Put the cases in the most readable order (for the next developer working on that same source code). Read also about threaded code. If you have objective (performance related) reasons to re-order the case-s (which is very unlikely and should happen at most once in your lifetime), write some good comment explaining those reasons.
If you care that much about performance, be sure to benchmark and profile, and choose a good compiler and use it with relevant optimization options. Perhaps experiment several different optimization settings (and maybe several compilers). You may want to add -march=native (in addition of -O2 or -O3). You could consider compiling and linking with -flto -O2 to enable link-time optimizations, etc. You might also want profile based optimizations.
BTW, many compilers are huge free software projects (in particular GCC and Clang). If you care that much about performance, you might patch the compiler, extend it by adding some additional optimization pass (by forking the source code, by adding some plugin to GCC or some GCC MELT extensions). That requires months or years of work (notably to understand the internal representations and organization of that compiler).
(Don't forget to take development costs into account; in most cases, they cost much more)

Performance would depend mostly of the number of branch misses for a given dataset, not so much on the total number of cases. And that in turn depends highly on the actual data and how the compiler chose to implement the switch (dispatch table, chained conditionals, tree of conditionals -- not sure if you can even control this from C).

In cases where most case labels are consecutive, compilers will often process switch statements to use jump tables rather than comparisons. The exact means by which compilers decide what form of computed jump to use (if any) will vary among different implementations. Sometimes adding extra cases to a switch statement may improve performance by simplifying a compiler's generated code (e.g. if code uses cases 4-11, while cases 0-3 are handled in default fashion, adding explicit case 0:; case 1:; case 2:; case 3:; prior to the default: may result in the compiler comparing the operand against 12 and, if it's less, using a 12-item jump table. Omitting those cases might cause the compiler to subtract 4 before comparing the difference against 8, and then use an 8-item table.
One difficulty in trying to optimize switch statements is that compilers generally know better than programmers how the performance of different approaches would vary when given certain inputs, but programmers may know better than compilers what distribution of inputs a program would receive. Given something like:
if (x==0)
y++;
else switch(x)
{
...
}
a "smart" compiler might recognize that changing the code to:
switch(x)
{
case 0:
y++;
break;
...
}
could eliminate a comparison in all cases where x is non-zero, at the cost
of a computed jump when x is zero. If x is non-zero most of the time,
that would be a good trade. If x is zero 99.9% of the time, however, that
might be a bad trade. Different compiler writers differ as to the extent to
which they will try to optimize constructs like former into the latter.

The switch statement is usually compiled via jump tables not by simple comparaisons.
So, there is no loss in performance if you permute the case-statements.
However, sometimes it is useful to keep more cases in consecutive order and not to use break/return in some entries, in order for the flow of execution to go to the next case and avoid duplicating the code.
When the differences of numbers between case number are big from one case to the other, such in case 10: and case 200000: the compiler will surely not generate jump tables, as it should fill about 200K entries almost all with a pointer toward the default: case, and in this case it will use comparaisons.

Your question is very simple - your code isn't the same, so it won't produce the same assembly! Optimised code doesn't just depend on the individual statements, but also on everything around it. And in this case, it's easy to explain the optimisation.
In your first example, case 1 results in a=1, and case 2 results in a=2. The compiler can optimise this to set a=sc for those two cases, which is a single statement.
In your second example, case 1 results in a=2, and case 2 results in a=1. The compiler can no longer take that shortcut, so it has to explicitly set a=1 or a=2 for both cases. Of course this needs more code.
If you simply took your first example and swapped the order of the cases and conditional code then you should get the same assembler.
You can test this optimisation by using the code
int main()
{
int a, sc = 1;
switch (sc)
{
case 1:
case 2:
a = sc;
break;
}
}
which should also give exactly the same assembler.
Incidentally, your test code assumes that sc is actually read. Most modern optimising compilers are able to spot that sc does not change between assignment and the switch statement, and replace reading sc with a constant value 1. Further optimisation will then remove the redundant branch(es) of the switch statement, and then even the assignment could be optimised away because a does not actually change. And from the point of view of variable a, the compiler may also discover that a is not read elsewhere and so remove that variable from the code completely.
If you really want sc to be read and a to be set, you need to declare them both volatile. Fortunately the compiler seems to have implemented it the way you expected - but you absolutely cannot expect this when you have optimisation turned on.

You should probably enable the optimisations for your compiler before comparing assembly code, however the problem is that your variable is known at compile time so the compiler can remove everything from your function because it doesn't have any side effects.
This example shows that even if you change the order of the cases in a switch statement in your example, GCC and most other compilers will reorder them if the optimisations are enabled.
I used extern functions to make sure the values are only known at run-time but I could also have used rand for example.
Also, when you add more cases, the compiler may replace the conditional jumps by a table which contains the addresses of the functions and it will still get reordered by GCC as can be seen here.

Integer powers in C

In C code it is common to write
a = b*b;
instead of
a = pow(b, 2.0);
for double variables. I get that since pow is a generic function capable of handling non-integer exponents, one should naïvely think that the first version is faster. I wonder however whether the compiler (gcc) transforms calls to pow with integer exponents to direct multiplication as part of any of the optional optimizations.
Assuming that this optimization does not take place, what is the largest integer exponent for which it is faster to write out the multiplication manually, as in b*b* ... *b?
I know that I could make performance tests on a given machine to figure out whether I should even care, but I would like to gain some deeper understanding on what is "the right thing" to do.

What you want is -ffinite-math-only -ffast-math and possibly #include <tgmath.h> This is the same as -Ofast without mandating the -O3 optimizations.
Not only does it help these kinds of optimizations when -ffinite-math-only and -ffast-math is enabled, the type generic math also helps compensate for when you forget to append the proper suffix to a (non-double) math function.
For example:
#include <tgmath.h>
float pow4(float f){return pow(f,4.0f);}
//compiles to
pow4:
vmulss xmm0, xmm0, xmm0
vmulss xmm0, xmm0, xmm0
ret
For clang this works for powers up to 32, while gcc does this for powers up to at least 2,147,483,647 (that's as far as I checked) unless -Os is enabled (because a jmp to the pow function is technically smaller) - with -Os, it will only do a power of 2.
WARNING -ffast-math is just a convenience alias to several other optimizations, many of which break all kinds of standards. If you'd rather use only the minimal flags to get this desired behavior, then you can use -fno-math-errno -funsafe-math-optimizations -ffinite-math-only

In terms of the right thing to - consider your maintainer not just performance. I have a hunch you are looking for a general rule. If you are doing a simple and consistent square or cube of a number, I would not use pow for these. pow will most likely be making some form of a subroutine call versus performing register operations (which is why Martin pointed out architecture dependendency).

Which is faster: while(1) or while(2)?

This was an interview question asked by a senior manager.
Which is faster?
while(1) {
// Some code
}
or
while(2) {
//Some code
}
I said that both have the same execution speed, as the expression inside while should finally evaluate to true or false. In this case, both evaluate to true and there are no extra conditional instructions inside the while condition. So, both will have the same speed of execution and I prefer while (1).
But the interviewer said confidently:
"Check your basics. while(1) is faster than while(2)."
(He was not testing my confidence)
Is this true?
See also: Is "for(;;)" faster than "while (TRUE)"? If not, why do people use it?

Both loops are infinite, but we can see which one takes more instructions/resources per iteration.
Using gcc, I compiled the two following programs to assembly at varying levels of optimization:
int main(void) {
while(1) {}
return 0;
}
int main(void) {
while(2) {}
return 0;
}
Even with no optimizations (-O0), the generated assembly was identical for both programs. Therefore, there is no speed difference between the two loops.
For reference, here is the generated assembly (using gcc main.c -S -masm=intel with an optimization flag):
With -O0:
.file "main.c"
.intel_syntax noprefix
.def __main; .scl 2; .type 32; .endef
.text
.globl main
.def main; .scl 2; .type 32; .endef
.seh_proc main
main:
push rbp
.seh_pushreg rbp
mov rbp, rsp
.seh_setframe rbp, 0
sub rsp, 32
.seh_stackalloc 32
.seh_endprologue
call __main
.L2:
jmp .L2
.seh_endproc
.ident "GCC: (tdm64-2) 4.8.1"
With -O1:
.file "main.c"
.intel_syntax noprefix
.def __main; .scl 2; .type 32; .endef
.text
.globl main
.def main; .scl 2; .type 32; .endef
.seh_proc main
main:
sub rsp, 40
.seh_stackalloc 40
.seh_endprologue
call __main
.L2:
jmp .L2
.seh_endproc
.ident "GCC: (tdm64-2) 4.8.1"
With -O2 and -O3 (same output):
.file "main.c"
.intel_syntax noprefix
.def __main; .scl 2; .type 32; .endef
.section .text.startup,"x"
.p2align 4,,15
.globl main
.def main; .scl 2; .type 32; .endef
.seh_proc main
main:
sub rsp, 40
.seh_stackalloc 40
.seh_endprologue
call __main
.L2:
jmp .L2
.seh_endproc
.ident "GCC: (tdm64-2) 4.8.1"
In fact, the assembly generated for the loop is identical for every level of optimization:
.L2:
jmp .L2
.seh_endproc
.ident "GCC: (tdm64-2) 4.8.1"
The important bits being:
.L2:
jmp .L2
I can't read assembly very well, but this is obviously an unconditional loop. The jmp instruction unconditionally resets the program back to the .L2 label without even comparing a value against true, and of course immediately does so again until the program is somehow ended. This directly corresponds to the C/C++ code:
L2:
goto L2;
Edit:
Interestingly enough, even with no optimizations, the following loops all produced the exact same output (unconditional jmp) in assembly:
while(42) {}
while(1==1) {}
while(2==2) {}
while(4<7) {}
while(3==3 && 4==4) {}
while(8-9 < 0) {}
while(4.3 * 3e4 >= 2 << 6) {}
while(-0.1 + 02) {}
And even to my amazement:
#include<math.h>
while(sqrt(7)) {}
while(hypot(3,4)) {}
Things get a little more interesting with user-defined functions:
int x(void) {
return 1;
}
while(x()) {}
#include<math.h>
double x(void) {
return sqrt(7);
}
while(x()) {}
At -O0, these two examples actually call x and perform a comparison for each iteration.
First example (returning 1):
.L4:
call x
testl %eax, %eax
jne .L4
movl $0, %eax
addq $32, %rsp
popq %rbp
ret
.seh_endproc
.ident "GCC: (tdm64-2) 4.8.1"
Second example (returning sqrt(7)):
.L4:
call x
xorpd %xmm1, %xmm1
ucomisd %xmm1, %xmm0
jp .L4
xorpd %xmm1, %xmm1
ucomisd %xmm1, %xmm0
jne .L4
movl $0, %eax
addq $32, %rsp
popq %rbp
ret
.seh_endproc
.ident "GCC: (tdm64-2) 4.8.1"
However, at -O1 and above, they both produce the same assembly as the previous examples (an unconditional jmp back to the preceding label).
TL;DR
Under GCC, the different loops are compiled to identical assembly. The compiler evaluates the constant values and doesn't bother performing any actual comparison.
The moral of the story is:
There exists a layer of translation between C source code and CPU instructions, and this layer has important implications for performance.
Therefore, performance cannot be evaluated by only looking at source code.
The compiler should be smart enough to optimize such trivial cases. Programmers should not waste their time thinking about them in the vast majority of cases.

Yes, while(1) is much faster than while(2), for a human to read! If I see while(1) in an unfamiliar codebase, I immediately know what the author intended, and my eyeballs can continue to the next line.
If I see while(2), I'll probably halt in my tracks and try to figure out why the author didn't write while(1). Did the author's finger slip on the keyboard? Do the maintainers of this codebase use while(n) as an obscure commenting mechanism to make loops look different? Is it a crude workaround for a spurious warning in some broken static analysis tool? Or is this a clue that I'm reading generated code? Is it a bug resulting from an ill-advised find-and-replace-all, or a bad merge, or a cosmic ray? Maybe this line of code is supposed to do something dramatically different. Maybe it was supposed to read while(w) or while(x2). I'd better find the author in the file's history and send them a "WTF" email... and now I've broken my mental context. The while(2) might consume several minutes of my time, when while(1) would have taken a fraction of a second!
I'm exaggerating, but only a little. Code readability is really important. And that's worth mentioning in an interview!

The existing answers showing the code generated by a particular compiler for a particular target with a particular set of options do not fully answer the question -- unless the question was asked in that specific context ("Which is faster using gcc 4.7.2 for x86_64 with default options?", for example).
As far as the language definition is concerned, in the abstract machine while (1) evaluates the integer constant 1, and while (2) evaluates the integer constant 2; in both cases the result is compared for equality to zero. The language standard says absolutely nothing about the relative performance of the two constructs.
I can imagine that an extremely naive compiler might generate different machine code for the two forms, at least when compiled without requesting optimization.
On the other hand, C compilers absolutely must evaluate some constant expressions at compile time, when they appear in contexts that require a constant expression. For example, this:
int n = 4;
switch (n) {
case 2+2: break;
case 4: break;
}
requires a diagnostic; a lazy compiler does not have the option of deferring the evaluation of 2+2 until execution time. Since a compiler has to have the ability to evaluate constant expressions at compile time, there's no good reason for it not to take advantage of that capability even when it's not required.
The C standard (N1570 6.8.5p4) says that
An iteration statement causes a statement called the loop body to be
executed repeatedly until the controlling expression compares equal to
0.
So the relevant constant expressions are 1 == 0 and 2 == 0, both of which evaluate to the int value 0. (These comparison are implicit in the semantics of the while loop; they don't exist as actual C expressions.)
A perversely naive compiler could generate different code for the two constructs. For example, for the first it could generate an unconditional infinite loop (treating 1 as a special case), and for the second it could generate an explicit run-time comparison equivalent to 2 != 0. But I've never encountered a C compiler that would actually behave that way, and I seriously doubt that such a compiler exists.
Most compilers (I'm tempted to say all production-quality compilers) have options to request additional optimizations. Under such an option, it's even less likely that any compiler would generate different code for the two forms.
If your compiler generates different code for the two constructs, first check whether the differing code sequences actually have different performance. If they do, try compiling again with an optimization option (if available). If they still differ, submit a bug report to the compiler vendor. It's not (necessarily) a bug in the sense of a failure to conform to the C standard, but it's almost certainly a problem that should be corrected.
Bottom line: while (1) and while(2) almost certainly have the same performance. They have exactly the same semantics, and there's no good reason for any compiler not to generate identical code.
And though it's perfectly legal for a compiler to generate faster code for while(1) than for while(2), it's equally legal for a compiler to generate faster code for while(1) than for another occurrence of while(1) in the same program.
(There's another question implicit in the one you asked: How do you deal with an interviewer who insists on an incorrect technical point. That would probably be a good question for the Workplace site).

Wait a minute. The interviewer, did he look like this guy?
It's bad enough that the interviewer himself has failed this interview,
what if other programmers at this company have "passed" this test?
No. Evaluating the statements 1 == 0 and 2 == 0 should be equally fast. We could imagine poor compiler implementations where one might be faster than the other. But there's no good reason why one should be faster than the other.
Even if there's some obscure circumstance when the claim would be true, programmers should not be evaluated based on knowledge of obscure (and in this case, creepy) trivia. Don't worry about this interview, the best move here is to walk away.
Disclaimer: This is NOT an original Dilbert cartoon. This is merely a mashup.

Your explanation is correct. This seems to be a question that tests your self-confidence in addition to technical knowledge.
By the way, if you answered
Both pieces of code are equally fast, because both take infinite time to complete
the interviewer would say
But while (1) can do more iterations per second; can you explain why? (this is nonsense; testing your confidence again)
So by answering like you did, you saved some time which you would otherwise waste on discussing this bad question.
Here is an example code generated by the compiler on my system (MS Visual Studio 2012), with optimizations turned off:
yyy:
xor eax, eax
cmp eax, 1 (or 2, depending on your code)
je xxx
jmp yyy
xxx:
...
With optimizations turned on:
xxx:
jmp xxx
So the generated code is exactly the same, at least with an optimizing compiler.

The most likely explanation for the question is that the interviewer thinks that the processor checks the individual bits of the numbers, one by one, until it hits a non-zero value:
1 = 00000001
2 = 00000010
If the "is zero?" algorithm starts from the right side of the number and has to check each bit until it reaches a non-zero bit, the while(1) { } loop would have to check twice as many bits per iteration as the while(2) { } loop.
This requires a very wrong mental model of how computers work, but it does have its own internal logic. One way to check would be to ask if while(-1) { } or while(3) { } would be equally fast, or if while(32) { } would be even slower.

Of course I do not know the real intentions of this manager, but I propose a completely different view: When hiring a new member into a team, it is useful to know how he reacts to conflict situations.
They drove you into conflict. If this is true, they are clever and the question was good. For some industries, like banking, posting your problem to Stack Overflow could be a reason for rejection.
But of course I do not know, I just propose one option.

I think the clue is to be found in "asked by a senior manager". This person obviously stopped programming when he became a manager and then it took him/her several years to become a senior manager. Never lost interest in programming, but never wrote a line since those days. So his reference is not "any decent compiler out there" as some answers mention, but "the compiler this person worked with 20-30 years ago".
At that time, programmers spent a considerable percentage of their time trying out various methods for making their code faster and more efficient as CPU time of 'the central minicomputer' was so valueable. As did people writing compilers. I'm guessing that the one-and-only compiler his company made available at that time optimized on the basis of 'frequently encountered statements that can be optimized' and took a bit of a shortcut when encountering a while(1) and evaluated everything else, including a while(2). Having had such an experience could explain his position and his confidence in it.
The best approach to get you hired is probably one that enables the senior manager to get carried away and lecture you 2-3 minutes on "the good old days of programming" before YOU smoothly lead him towards the next interview subject. (Good timing is important here - too fast and you're interrupting the story - too slow and you are labelled as somebody with insufficient focus). Do tell him at the end of the interview that you'd be highly interested to learn more about this topic.

You should have asked him how did he reached to that conclusion. Under any decent compiler out there, the two compile to the same asm instructions. So, he should have told you the compiler as well to start off. And even so, you would have to know the compiler and platform very well to even make a theoretical educated guess. And in the end, it doesn't really matter in practice, since there are other external factors like memory fragmentation or system load that will influence the loop more than this detail.

For the sake of this question, I should that add I remember Doug Gwyn from C Committee writing that some early C compilers without the optimizer pass would generate a test in assembly for the while(1) (comparing to for(;;) which wouldn't have it).
I would answer to the interviewer by giving this historical note and then say that even if I would be very surprised any compiler did this, a compiler could have:
without optimizer pass the compiler generate a test for both while(1) and while(2)
with optimizer pass the compiler is instructed to optimize (with an unconditional jump) all while(1) because they are considered as idiomatic. This would leave the while(2) with a test and therefore makes a performance difference between the two.
I would of course add to the interviewer that not considering while(1) and while(2) the same construct is a sign of low-quality optimization as these are equivalent constructs.

Another take on such a question would be to see if you got courage to tell your manager that he/she is wrong! And how softly you can communicate it.
My first instinct would have been to generate assembly output to show the manager that any decent compiler should take care of it, and if it's not doing so, you will submit the next patch for it :)

To see so many people delve into this problem, shows exactly why this could very well be a test to see how quickly you want to micro-optimize things.
My answer would be; it doesn't matter that much, I rather focus on the business problem which we are solving. After all, that's what I'm going to be paid for.
Moreover, I would opt for while(1) {} because it is more common, and other teammates would not need to spend time to figure out why someone would go for a higher number than 1.
Now go write some code. ;-)

If you're that worried about optimisation, you should use
for (;;)
because that has no tests. (cynic mode)

It seems to me this is one of those behavioral interview questions masked as a technical question. Some companies do this - they will ask a technical question that should be fairly easy for any competent programmer to answer, but when the interviewee gives the correct answer, the interviewer will tell them they are wrong.
The company wants to see how you will react in this situation. Do you sit there quietly and don't push that your answer is correct, due to either self-doubt or fear of upsetting the interviewer? Or are you willing to challenge a person in authority who you know is wrong? They want to see if you are willing to stand up for your convictions, and if you can do it in a tactful and respectful manner.

Here's a problem: If you actually write a program and measure its speed, the speed of both loops could be different! For some reasonable comparison:
unsigned long i = 0;
while (1) { if (++i == 1000000000) break; }
unsigned long i = 0;
while (2) { if (++i == 1000000000) break; }
with some code added that prints the time, some random effect like how the loop is positioned within one or two cache lines could make a difference. One loop might by pure chance be completely within one cache line, or at the start of a cache line, or it might to straddle two cache lines. And as a result, whatever the interviewer claims is fastest might actually be fastest - by coincidence.
Worst case scenario: An optimising compiler doesn't figure out what the loop does, but figures out that the values produced when the second loop is executed are the same ones as produced by the first one. And generate full code for the first loop, but not for the second.

I used to program C and Assembly code back when this sort of nonsense might have made a difference. When it did make a difference we wrote it in Assembly.
If I were asked that question I would have repeated Donald Knuth's famous 1974 quote about premature optimization and walked if the interviewer didn't laugh and move on.

Maybe the interviewer posed such dumb question intentionally and wanted you to make 3 points:
Basic reasoning. Both loops are infinite, it's hard to talk about performance.
Knowledge about optimisation levels. He wanted to hear from you if you let the compiler do any optimisation for you, it would optimise the condition, especially if the block was not empty.
Knowledge about microprocessor architecture. Most architectures have a special CPU instruction for comparision with 0 (while not necessarily faster).

They are both equal - the same.
According to the specifications anything that is not 0 is considered true, so even without any optimization, and a good compiler will not generate any code
for while(1) or while(2). The compiler would generate a simple check for != 0.

Judging by the amount of time and effort people have spent testing, proving, and answering this very straight forward question I'd say that both were made very slow by asking the question.
And so to spend even more time on it...
while (2) is ridiculous, because,
while (1), and while (true) are historically used to make an infinite loop which expects break to be called at some stage inside the loop based upon a condition that will certainly occur.
The 1 is simply there to always evaluate to true and therefore, to say while (2) is about as silly as saying while (1 + 1 == 2) which will also evaluate to true.
And if you want to be completely silly just use: -
while (1 + 5 - 2 - (1 * 3) == 0.5 - 4 + ((9 * 2) / 4.0)) {
if (succeed())
break;
}
I think that the interviewer made a typo which did not effect the running of the code, but if he intentionally used the 2 just to be weird then sack him before he puts weird statements all through your code making it difficult to read and work with.

That depends on the compiler.
If it optimizes the code, or if it evaluates 1 and 2 to true with the same number of instructions for a particular instruction set, the execution speed will be the same.
In real cases it will always be equally fast, but it would be possible to imagine a particular compiler and a particular system when this would be evaluated differently.
I mean: this is not really a language (C) related question.

Since people looking to answer this question want the fastest loop, I would have answered that both are equally compiling into the same assembly code, as stated in the other answers. Nevertheless you can suggest to the interviewer using 'loop unrolling'; a do {} while loop instead of the while loop.
Cautious: You need to ensure that the loop would at least always run once.
The loop should have a break condition inside.
Also for that kind of loop I would personally prefer the use of do {} while(42) since any integer, except 0, would do the job.

The obvious answer is: as posted, both fragments would run an equally busy infinite loop, which makes the program infinitely slow.
Although redefining C keywords as macros would technically have undefined behavior, it is the only way I can think of to make either code fragment fast at all: you can add this line above the 2 fragments:
#define while(x) sleep(x);
it will indeed make while(1) twice as fast (or half as slow) as while(2).

The only reason I can think of why the while(2) would be any slower is:
The code optimizes the loop to
cmp eax, 2
When the subtract occurs you're essentially subtracting
a. 00000000 - 00000010 cmp eax, 2
instead of
b. 00000000 - 00000001 cmp eax, 1
cmp only sets flags and does not set a result. So on the least significant bits we know if we need to borrow or not with b. Whereas with a you have to perform two subtractions before you get a borrow.

C for loop indexing: is forward-indexing faster in new CPUs?

On a mailing list I'm subscribed to, two fairly knowledgeable (IMO) programmers were discussing some optimized code, and saying something along the lines of:
On the CPUs released 5-8 years ago, it was slightly faster to iterate for loops backwards (e.g. for (int i=x-1; i>=0; i--) {...}) because comparing i to zero is more efficient than comparing it to some other number. But with very recent CPUs (e.g. from 2008-2009) the speculative loader logic is such that it works better if the for loop is iterated forward (e.g. for (int i=0; i< x; i++) {...}).
My question is, is that true? Have CPU implementations changed recently such that forward-loop-iterating now has an advantage over backward-iterating? If so, what is the explanation for that? i.e. what changed?
(Yes, I know, premature optimization is the root of all evil, review my algorithm before worrying about micro-optimizations, etc etc... mostly I'm just curious)

You're really asking about prefetching, not about loop control logic.
In general, loop performance isn't going to be dictated by the control logic (i.e. the increment/decrement and the condition that gets checked every time through). The time it takes to do these things is inconsequential except in very tight loops. If you're interested in that, take a look at John Knoeller's answer for specifics on the 8086's counter register and why it might've been true in the old days that counting down was more efficient. As John says, branch prediction (and also speculation) can play a role in performance here, as can instruction prefetching.
Iteration order can affect performance significantly when it changes the order in which your loop touches memory. The order in which you request memory addresses can affect what is drawn into your cache and also what is evicted from your cache when there is no longer room to fetch new cache lines. Having to go to memory more often than needed is much more expensive than compares, increments, or decrements. On modern CPUs it can take thousands of cycles to get from the processor to memory, and your processor may have to idle for some or all of that time.
You're probably familiar with caches, so I won't go into all those details here. What you may not know is that modern processors employ a whole slew of prefetchers to try to predict what data you're going to need next at different levels of the memory hierarchy. Once they predict, they try to pull that data from memory or lower level caches so that you have what you need when you get around to processing it. Depending on how well they grab what you need next, your performance may or may not improve when using them.
Take a look at Intel's guide to optimizing for hardware prefetchers. There are four prefetchers listed; two for NetBurst chips:
NetBurst's hardware prefetcher can detect streams of memory accesses in either forward or backward directions, and it will try to load data from those locations into the L2 cache.
NetBurst also has an adjacent cache line (ACL) prefetcher, which will automatically load two adjacent cache lines when you fetch the first one.
and two for Core:
Core has a slightly more sophisticated hardware prefetcher; it can detect strided access in addition to streams of contiguous references, so it'll do better if you step through an array every other element, every 4th, etc.
Core also has an ACL prefetcher like NetBurst.
If you're iterating through an array forward, you're going to generate a bunch of sequential, usually contiguous memory references. The ACL prefetchers are going to do much better for forward loops (because you'll end up using those subsequent cache lines) than for backward loops, but you may do ok making memory references backward if the prefetchers can detect this (as with the hardware prefetchers). The hardware prefetchers on the Core can detect strides, which is helpful for for more sophisticated array traversals.
These simple heuristics can get you into trouble in some cases. For example, Intel actually recommends that you turn off adjacent cache line prefetching for servers, because they tend to make more random memory references than desktop user machines. The probability of not using an adjacent cache line is higher on a server, so fetching data you're not actually going to use ends up polluting your cache (filling it with unwanted data), and performance suffers. For more on addressing this kind of problem, take a look at this paper from Supercomputing 2009 on using machine learning to tune prefetchers in large data centers. Some guys at Google are on that paper; performance is something that is of great concern to them.
Simple heuristics aren't going to help you with more sophisticated algorithms, and you might have to start thinking about the sizes of your L1, L2, etc. caches. Image processing, for example, often requires that you perform some operation on subsections of a 2D image, but the order you traverse the image can affect how well useful pieces of it stay in your cache without being evicted. Take a look at Z-order traversals and loop tiling if you're interested in this sort of thing. It's a pretty basic example of mapping the 2D locality of image data to the 1D locality of memory to improve performance. It's also an area where compilers aren't always able to restructure your code in the best way, but manually restructuring your C code can improve cache performance drastically.
I hope this gives you an idea of how iteration order affects memory performance. It does depend on the particular architecture, but the ideas are general. You should be able to understand prefetching on AMD and Power if you can understand it on Intel, and you don't really have to know assembly to structure your code to take advantage of memory. You just need to know a little computer architecture.

I don't know. But I do know how to write a quick benchmark with no guarantees of scientific validity (actually, one with rather strict guarantees of invalidity). It has interesting results:
#include <time.h>
#include <stdio.h>
int main(void)
{
int i;
int s;
clock_t start_time, end_time;
int centiseconds;
start_time = clock();
s = 1;
for (i = 0; i < 1000000000; i++)
{
s = s + i;
}
end_time = clock();
centiseconds = (end_time - start_time)*100 / CLOCKS_PER_SEC;
printf("Answer is %d; Forward took %ld centiseconds\n", s, centiseconds);
start_time = clock();
s = 1;
for (i = 999999999; i >= 0; i--)
{
s = s + i;
}
end_time = clock();
centiseconds = (end_time - start_time)*100 / CLOCKS_PER_SEC;
printf("Answer is %d; Backward took %ld centiseconds\n", s, centiseconds);
return 0;
}
Compiled with -O9 using gcc 3.4.4 on Cygwin, running on an "AMD Athlon(tm) 64 Processor 3500+" (2211 MHz) in 32 bit Windows XP:
Answer is -1243309311; Forward took 93 centiseconds
Answer is -1243309311; Backward took 92 centiseconds
(Answers varied by 1 either way in several repetitions.)
Compiled with -I9 using gcc 4.4.1 running on an "Intel(R) Atom(TM) CPU N270 # 1.60GHz" (800 MHz and presumably only one core, given the program) in 32 bit Ubuntu Linux.
Answer is -1243309311; Forward took 196 centiseconds
Answer is -1243309311; Backward took 228 centiseconds
(Answers varied by 1 either way in several repetitions.)
Looking at the code, the forward loop is translated to:
; Gcc 3.4.4 on Cygwin for Athlon ; Gcc 4.4.1 on Ubuntu for Atom
L5: .L2:
addl %eax, %ebx addl %eax, %ebx
incl %eax addl $1, %eax
cmpl $999999999, %eax cmpl $1000000000, %eax
jle L5 jne .L2
The backward to:
L9: .L3:
addl %eax, %ebx addl %eax, %ebx
decl %eax subl $1, $eax
jns L9 cmpl $-1, %eax
jne .L3
Which shows, if not much else, that GCC's behaviour has changed between those two versions!
Pasting the older GCC's loops into the newer GCC's asm file gives results of:
Answer is -1243309311; Forward took 194 centiseconds
Answer is -1243309311; Backward took 133 centiseconds
Summary: on the >5 year old Athlon, the loops generated by GCC 3.4.4 are the same speed. On the newish (<1 year?) Atom, the backward loop is significantly faster. GCC 4.4.1 has a slight regression for this particular case which I personally am not bothered about in the least, given the point of it. (I had to make sure that s is used after the loop, because otherwise the compiler would elide the computation altogether.)
[1] I can never remember the command for system info...

Yes. but with a caveat. The idea that looping backwards is faster never applied to all older CPUs. It's an x86 thing (as in 8086 through 486, possibly Pentium, although I don't think any further).
That optimization never applied to any other CPU architecture that I know of.
Here's why.
The 8086 had a register that was specifically optimized for use as a loop counter. You put your loop count in CX, and then there are several instructions that decrement CX and then set condition codes if it goes to zero. In fact there was an instruction prefix you could put before other instructions (the REP prefix) that would basically iterate the other instruction until CX got to 0.
Back in the days when we counted instructions and instructions had known fixed cycle counts using cx as your loop counter was the way to go, and cx was optimized for counting down.
But that was a long time ago. Ever since the Pentium, those complex instructions have been slower overall than using more, and simpler instructions. (RISC baby!) The key thing we try to do these days is try to put some time between loading a register and using it because the pipelines can actually do multiple things per cycle as long as you don't try to use the same register for more than one thing at a time.
Nowdays the thing that kills performance isn't the comparison, it's the branching, and then only when the branch prediction predicts wrong.

I stumbled upon this question after observing a significant drop in performance when iterating over an array backwards vs forwards. I was afraid it would be the prefetcher, but the previous answers convinced me this was not the case. I then investigated further and found out that it looks like GCC (4.8.4) is unable to exploit the full power of SIMD operations in a backward loop.
In fact, compiling the following code (from here) with -S -O3 -mavx:
for (i = 0; i < N; ++i)
r[i] = (a[i] + b[i]) * c[i];
leads to essentially:
.L10:
addl $1, %edx
vmovupd (%rdi,%rax), %xmm1
vinsertf128 $0x1, 16(%rdi,%rax), %ymm1, %ymm1
vmovupd (%rsi,%rax), %xmm0
vinsertf128 $0x1, 16(%rsi,%rax), %ymm0, %ymm0
vaddpd (%r9,%rax), %ymm1, %ymm1
vmulpd %ymm0, %ymm1, %ymm0
vmovupd %xmm0, (%rcx,%rax)
vextractf128 $0x1, %ymm0, 16(%rcx,%rax)
addq $32, %rax
cmpl %r8d, %edx
jb .L10
i.e. assembly code that uses the AVX extensions to perform four double operations in parallel (for example, vaddpd and vmulpd).
Conversely, the following code compiled with the same parameters:
for (i = 0; i < N; ++i)
r[N-1-i] = (a[N-1-i] + b[N-1-i]) * c[N-1-i];
produces:
.L5:
vmovsd a+79992(%rax), %xmm0
subq $8, %rax
vaddsd b+80000(%rax), %xmm0, %xmm0
vmulsd c+80000(%rax), %xmm0, %xmm0
vmovsd %xmm0, r+80000(%rax)
cmpq $-80000, %rax
jne .L5
which only performs one double operation at the time (vaddsd, vmulsd).
This fact alone may be responsible for a factor of 4 between the performance when iterating backward vs forward.
Using -ftree-vectorizer-verbose=2, it looks like the problem is storing backwards: "negative step for store". In fact, if a, b, and c are read backwards, but r is written into in the forward direction, and the code is vectorized again.

It probably doesn't make a hoot of difference speed-wise, but I often write:
for (i = n; --i >= 0; ) blah blah
which I think at one time generated cleaner assembly.
Of course, in answering this kind of question, I run the risk of affirming that this is important. It's a micro-optimization kind of question, which is closely related to premature optimization, which everybody says you shouldn't do, but nevertheless SO is awash in it.

No, we can't say that CPU implementations have changed to make forward looping faster. And that has very little to do with the CPUs themselves.
It has to do with the fact that you haven't specified which CPU you're talking about, nor which compiler.
You cannot ask a blanket question about CPU issues with the C tag and expect to get an intelligent answer simply because nothing in the C standard mandates how fast CPU s should be at various operations.
If you'd like to rephrase your question to target a specific CPU and machine language (since what machine language you get out of a C compiler depends entirely on the compiler), you may get a better answer.
In either case, it shouldn't matter. You should be relying on the fact that the people that wrote your compiler know a great deal more than you about how to eke the last inch of performance from the various CPUs.
The direction in which you should be iterating has always been dictated by what you have to do. For example, if you have to process array elements in ascending order, you use:
for (i = 0; i < 1000; i++) { process (a[i]); }
rather than:
for (i = 999; i >= 0; i--) { process (a[999-i]); }
simply because any advantage you may gain in going backwards is more than swamped by the extra calculations on i. It may well be that a naked loop (no work done in the body) may be faster in one direction than another but, if you have such a naked loop, it's not doing any real work anyway.
As an aside, it may well be that both those loops above will come down to the same machine code anyway. I've seen some of the code put out by the GCC optimizer and it made my head spin. Compiler writers are, in my opinion, a species alone when it comes to insane levels of optimization.
My advice: always program for readability first then target any specific performance problems you have ("get it working first, then get it working fast").

When optimizing loops I'd rather look into loop unrolling (as it cuts down the number of comparisons vs. the exit value, and it may be optimized for parallel processing (MMX) depending on what goes on inside the loop).

How many asm-instructions per C-instruction?

I realize that this question is impossible to answer absolutely, but I'm only after ballpark figures:
Given a reasonably sized C-program (thousands of lines of code), on average, how many ASM-instructions would be generated. In other words, what's a realistic C-to-ASM instruction ratio? Feel free to make assumptions, such as 'with current x86 architectures'.
I tried to Google about this, but I couldn't find anything.
Addendum: noticing how much confusion this question brought, I feel some need for an explanation: What I wanted to know by this answer, is to know, in practical terms, what "3GHz" means. I am fully aware of that the throughput per Herz varies tremendously depending on the architecture, your hardware, caches, bus speeds, and the position of the moon.
I am not after a precise and scientific answer, but rather an empirical answer that could be put into fathomable scales.
This isn't a trivial answer to place (as I became to notice), and this was my best effort at it. I know that the amount of resulting lines of ASM per lines of C varies depending on what you are doing. i++ is not in the same neighborhood as sqrt(23.1) - I know this. Additionally, no matter what ASM I get out of the C, the ASM is interpreted into various sets of microcode within the processor, which, again, depends on whether you are running AMD, Intel or something else, and their respective generations. I'm aware of this aswell.
The ballpark answers I've got so far are what I have been after: A project large enough averages at about 2 lines of x86 ASM per 1 line of ANSI-C. Today's processors probably would average at about one ASM command per clock cycle, once the pipelines are filled, and given a sample big enough.

There is no answer possible. statements like int a; might require zero asm lines. while statements like a = call_is_inlined(); might require 20+ asm lines.
You can see yourself by compiling a c program, and then starting objdump -Sd ./a.out . It will display asm and C code intermixed, so you can see how many asm lines are generated for one C line. Example:
test.c
int get_int(int c);
int main(void) {
int a = 1, b = 2;
return getCode(a) + b;
}
$ gcc -c -g test.c
$ objdump -Sd ./test.o
00000000 <main>:
int get_int(int c);
int main(void) { /* here, the prologue creates the frame for main */
0: 8d 4c 24 04 lea 0x4(%esp),%ecx
4: 83 e4 f0 and $0xfffffff0,%esp
7: ff 71 fc pushl -0x4(%ecx)
a: 55 push %ebp
b: 89 e5 mov %esp,%ebp
d: 51 push %ecx
e: 83 ec 14 sub $0x14,%esp
int a = 1, b = 2; /* setting up space for locals */
11: c7 45 f4 01 00 00 00 movl $0x1,-0xc(%ebp)
18: c7 45 f8 02 00 00 00 movl $0x2,-0x8(%ebp)
return getCode(a) + b;
1f: 8b 45 f4 mov -0xc(%ebp),%eax
22: 89 04 24 mov %eax,(%esp)
25: e8 fc ff ff ff call 26 <main+0x26>
2a: 03 45 f8 add -0x8(%ebp),%eax
} /* the epilogue runs, returning to the previous frame */
2d: 83 c4 14 add $0x14,%esp
30: 59 pop %ecx
31: 5d pop %ebp
32: 8d 61 fc lea -0x4(%ecx),%esp
35: c3 ret

I'm not sure what you mean by "C-instruction", maybe statement or line? Of course this will vary greatly due to a number of factors but after looking at a few sample programs of my own, many of them are close to the 2-1 mark (2 assembly instructions per LOC), I don't know what this means or how it might be useful.
You can figure this out yourself for any particular program and implementation combination by asking the compiler to generate only the assembly (gcc -S for example) or by using a disassembler on an already compiled executable (but you would need the source code to compare it to anyway).
Edit
Just to expand on this based on your clarification of what you are trying to accomplish (understanding how many lines of code a modern processor can execute in a second):
While a modern processor may run at 3 billion cycles per second that doesn't mean that it can execute 3 billion instructions per second. Here are some things to consider:
Many instructions take multiple cycles to execute (division or floating point operations can take dozens of cycles to execute).
Most programs spend the vast majority of their time waiting for things like memory accesses, disk accesses, etc.
Many other factors including OS overhead (scheduling, system calls, etc.) are also limiting factors.
But in general yes, processors are incredibly fast and can accomplish amazing things in a short period of time.

That varies tremendously! I woudn't believe anyone if they tried to offer a rough conversion.
Statements like i++; can translate to a single INC AX.
Statements for function calls containing many parameters can be dozens of instructions as the stack is setup for the call.
Then add in there the compiler optimization that will assemble your code in a manner different than you wrote it thus eliminating instructions.
Also some instructions run better on machine word boundaries so NOPs will be peppered throughout your code.

I don't think you can conclude anything useful whatsoever about performance of real applications from what you're trying to do here. Unless 'not precise' means 'within several orders of magnitude'.
You're just way overgeneralised, and you're dismissing caching, etc, as though it's secondary, whereas it may well be totally dominant.
If your application is large enough to have trended to some average instructions-per-loc, then it will also be large enough to have I/O or at the very least significant RAM access issues to factor in.

Depending on your environment you could use the visual studio option : /FAs
more here

I am not sure there is really a useful answer to this. For sure you will have to pick the architecture (as you suggested).
What I would do: Take a reasonable sized C program. Give gcc the "-S" option and check yourself. It will generate the assembler source code and you can calculate the ratio for that program yourself.

RISC or CISC? What's an instruction in C, anyway?
Which is to repeat the above points that you really have no idea until you get very specific about the type of code you're working with.
You might try reviewing the academic literature regarding assembly optimization and the hardware/software interference cross-talk that has happened over the last 30-40 years. That's where you're going to find some kind of real data about what you're interested in. (Although I warn you, you might wind up seeing C->PDP data instead of C->IA-32 data).

You wrote in one of the comments that you want to know what 3GHz means.
Even the frequency of the CPU does not matter. Modern PC-CPUs interleave and schedule instructions heavily, they fetch and prefetch, cache memory and instructions and often that cache is invalidated and thrown to the bin. The best interpretation of processing power can be gained by running real world performance benchmarks.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight