This was an interview question asked by a senior manager.
Which is faster?
while(1) {
// Some code
}
or
while(2) {
//Some code
}
I said that both have the same execution speed, as the expression inside while should finally evaluate to true or false. In this case, both evaluate to true and there are no extra conditional instructions inside the while condition. So, both will have the same speed of execution and I prefer while (1).
But the interviewer said confidently:
"Check your basics. while(1) is faster than while(2)."
(He was not testing my confidence)
Is this true?
See also: Is "for(;;)" faster than "while (TRUE)"? If not, why do people use it?
Both loops are infinite, but we can see which one takes more instructions/resources per iteration.
Using gcc, I compiled the two following programs to assembly at varying levels of optimization:
int main(void) {
while(1) {}
return 0;
}
int main(void) {
while(2) {}
return 0;
}
Even with no optimizations (-O0), the generated assembly was identical for both programs. Therefore, there is no speed difference between the two loops.
For reference, here is the generated assembly (using gcc main.c -S -masm=intel with an optimization flag):
With -O0:
.file "main.c"
.intel_syntax noprefix
.def __main; .scl 2; .type 32; .endef
.text
.globl main
.def main; .scl 2; .type 32; .endef
.seh_proc main
main:
push rbp
.seh_pushreg rbp
mov rbp, rsp
.seh_setframe rbp, 0
sub rsp, 32
.seh_stackalloc 32
.seh_endprologue
call __main
.L2:
jmp .L2
.seh_endproc
.ident "GCC: (tdm64-2) 4.8.1"
With -O1:
.file "main.c"
.intel_syntax noprefix
.def __main; .scl 2; .type 32; .endef
.text
.globl main
.def main; .scl 2; .type 32; .endef
.seh_proc main
main:
sub rsp, 40
.seh_stackalloc 40
.seh_endprologue
call __main
.L2:
jmp .L2
.seh_endproc
.ident "GCC: (tdm64-2) 4.8.1"
With -O2 and -O3 (same output):
.file "main.c"
.intel_syntax noprefix
.def __main; .scl 2; .type 32; .endef
.section .text.startup,"x"
.p2align 4,,15
.globl main
.def main; .scl 2; .type 32; .endef
.seh_proc main
main:
sub rsp, 40
.seh_stackalloc 40
.seh_endprologue
call __main
.L2:
jmp .L2
.seh_endproc
.ident "GCC: (tdm64-2) 4.8.1"
In fact, the assembly generated for the loop is identical for every level of optimization:
.L2:
jmp .L2
.seh_endproc
.ident "GCC: (tdm64-2) 4.8.1"
The important bits being:
.L2:
jmp .L2
I can't read assembly very well, but this is obviously an unconditional loop. The jmp instruction unconditionally resets the program back to the .L2 label without even comparing a value against true, and of course immediately does so again until the program is somehow ended. This directly corresponds to the C/C++ code:
L2:
goto L2;
Edit:
Interestingly enough, even with no optimizations, the following loops all produced the exact same output (unconditional jmp) in assembly:
while(42) {}
while(1==1) {}
while(2==2) {}
while(4<7) {}
while(3==3 && 4==4) {}
while(8-9 < 0) {}
while(4.3 * 3e4 >= 2 << 6) {}
while(-0.1 + 02) {}
And even to my amazement:
#include<math.h>
while(sqrt(7)) {}
while(hypot(3,4)) {}
Things get a little more interesting with user-defined functions:
int x(void) {
return 1;
}
while(x()) {}
#include<math.h>
double x(void) {
return sqrt(7);
}
while(x()) {}
At -O0, these two examples actually call x and perform a comparison for each iteration.
First example (returning 1):
.L4:
call x
testl %eax, %eax
jne .L4
movl $0, %eax
addq $32, %rsp
popq %rbp
ret
.seh_endproc
.ident "GCC: (tdm64-2) 4.8.1"
Second example (returning sqrt(7)):
.L4:
call x
xorpd %xmm1, %xmm1
ucomisd %xmm1, %xmm0
jp .L4
xorpd %xmm1, %xmm1
ucomisd %xmm1, %xmm0
jne .L4
movl $0, %eax
addq $32, %rsp
popq %rbp
ret
.seh_endproc
.ident "GCC: (tdm64-2) 4.8.1"
However, at -O1 and above, they both produce the same assembly as the previous examples (an unconditional jmp back to the preceding label).
TL;DR
Under GCC, the different loops are compiled to identical assembly. The compiler evaluates the constant values and doesn't bother performing any actual comparison.
The moral of the story is:
There exists a layer of translation between C source code and CPU instructions, and this layer has important implications for performance.
Therefore, performance cannot be evaluated by only looking at source code.
The compiler should be smart enough to optimize such trivial cases. Programmers should not waste their time thinking about them in the vast majority of cases.
Yes, while(1) is much faster than while(2), for a human to read! If I see while(1) in an unfamiliar codebase, I immediately know what the author intended, and my eyeballs can continue to the next line.
If I see while(2), I'll probably halt in my tracks and try to figure out why the author didn't write while(1). Did the author's finger slip on the keyboard? Do the maintainers of this codebase use while(n) as an obscure commenting mechanism to make loops look different? Is it a crude workaround for a spurious warning in some broken static analysis tool? Or is this a clue that I'm reading generated code? Is it a bug resulting from an ill-advised find-and-replace-all, or a bad merge, or a cosmic ray? Maybe this line of code is supposed to do something dramatically different. Maybe it was supposed to read while(w) or while(x2). I'd better find the author in the file's history and send them a "WTF" email... and now I've broken my mental context. The while(2) might consume several minutes of my time, when while(1) would have taken a fraction of a second!
I'm exaggerating, but only a little. Code readability is really important. And that's worth mentioning in an interview!
The existing answers showing the code generated by a particular compiler for a particular target with a particular set of options do not fully answer the question -- unless the question was asked in that specific context ("Which is faster using gcc 4.7.2 for x86_64 with default options?", for example).
As far as the language definition is concerned, in the abstract machine while (1) evaluates the integer constant 1, and while (2) evaluates the integer constant 2; in both cases the result is compared for equality to zero. The language standard says absolutely nothing about the relative performance of the two constructs.
I can imagine that an extremely naive compiler might generate different machine code for the two forms, at least when compiled without requesting optimization.
On the other hand, C compilers absolutely must evaluate some constant expressions at compile time, when they appear in contexts that require a constant expression. For example, this:
int n = 4;
switch (n) {
case 2+2: break;
case 4: break;
}
requires a diagnostic; a lazy compiler does not have the option of deferring the evaluation of 2+2 until execution time. Since a compiler has to have the ability to evaluate constant expressions at compile time, there's no good reason for it not to take advantage of that capability even when it's not required.
The C standard (N1570 6.8.5p4) says that
An iteration statement causes a statement called the loop body to be
executed repeatedly until the controlling expression compares equal to
0.
So the relevant constant expressions are 1 == 0 and 2 == 0, both of which evaluate to the int value 0. (These comparison are implicit in the semantics of the while loop; they don't exist as actual C expressions.)
A perversely naive compiler could generate different code for the two constructs. For example, for the first it could generate an unconditional infinite loop (treating 1 as a special case), and for the second it could generate an explicit run-time comparison equivalent to 2 != 0. But I've never encountered a C compiler that would actually behave that way, and I seriously doubt that such a compiler exists.
Most compilers (I'm tempted to say all production-quality compilers) have options to request additional optimizations. Under such an option, it's even less likely that any compiler would generate different code for the two forms.
If your compiler generates different code for the two constructs, first check whether the differing code sequences actually have different performance. If they do, try compiling again with an optimization option (if available). If they still differ, submit a bug report to the compiler vendor. It's not (necessarily) a bug in the sense of a failure to conform to the C standard, but it's almost certainly a problem that should be corrected.
Bottom line: while (1) and while(2) almost certainly have the same performance. They have exactly the same semantics, and there's no good reason for any compiler not to generate identical code.
And though it's perfectly legal for a compiler to generate faster code for while(1) than for while(2), it's equally legal for a compiler to generate faster code for while(1) than for another occurrence of while(1) in the same program.
(There's another question implicit in the one you asked: How do you deal with an interviewer who insists on an incorrect technical point. That would probably be a good question for the Workplace site).
Wait a minute. The interviewer, did he look like this guy?
It's bad enough that the interviewer himself has failed this interview,
what if other programmers at this company have "passed" this test?
No. Evaluating the statements 1 == 0 and 2 == 0 should be equally fast. We could imagine poor compiler implementations where one might be faster than the other. But there's no good reason why one should be faster than the other.
Even if there's some obscure circumstance when the claim would be true, programmers should not be evaluated based on knowledge of obscure (and in this case, creepy) trivia. Don't worry about this interview, the best move here is to walk away.
Disclaimer: This is NOT an original Dilbert cartoon. This is merely a mashup.
Your explanation is correct. This seems to be a question that tests your self-confidence in addition to technical knowledge.
By the way, if you answered
Both pieces of code are equally fast, because both take infinite time to complete
the interviewer would say
But while (1) can do more iterations per second; can you explain why? (this is nonsense; testing your confidence again)
So by answering like you did, you saved some time which you would otherwise waste on discussing this bad question.
Here is an example code generated by the compiler on my system (MS Visual Studio 2012), with optimizations turned off:
yyy:
xor eax, eax
cmp eax, 1 (or 2, depending on your code)
je xxx
jmp yyy
xxx:
...
With optimizations turned on:
xxx:
jmp xxx
So the generated code is exactly the same, at least with an optimizing compiler.
The most likely explanation for the question is that the interviewer thinks that the processor checks the individual bits of the numbers, one by one, until it hits a non-zero value:
1 = 00000001
2 = 00000010
If the "is zero?" algorithm starts from the right side of the number and has to check each bit until it reaches a non-zero bit, the while(1) { } loop would have to check twice as many bits per iteration as the while(2) { } loop.
This requires a very wrong mental model of how computers work, but it does have its own internal logic. One way to check would be to ask if while(-1) { } or while(3) { } would be equally fast, or if while(32) { } would be even slower.
Of course I do not know the real intentions of this manager, but I propose a completely different view: When hiring a new member into a team, it is useful to know how he reacts to conflict situations.
They drove you into conflict. If this is true, they are clever and the question was good. For some industries, like banking, posting your problem to Stack Overflow could be a reason for rejection.
But of course I do not know, I just propose one option.
I think the clue is to be found in "asked by a senior manager". This person obviously stopped programming when he became a manager and then it took him/her several years to become a senior manager. Never lost interest in programming, but never wrote a line since those days. So his reference is not "any decent compiler out there" as some answers mention, but "the compiler this person worked with 20-30 years ago".
At that time, programmers spent a considerable percentage of their time trying out various methods for making their code faster and more efficient as CPU time of 'the central minicomputer' was so valueable. As did people writing compilers. I'm guessing that the one-and-only compiler his company made available at that time optimized on the basis of 'frequently encountered statements that can be optimized' and took a bit of a shortcut when encountering a while(1) and evaluated everything else, including a while(2). Having had such an experience could explain his position and his confidence in it.
The best approach to get you hired is probably one that enables the senior manager to get carried away and lecture you 2-3 minutes on "the good old days of programming" before YOU smoothly lead him towards the next interview subject. (Good timing is important here - too fast and you're interrupting the story - too slow and you are labelled as somebody with insufficient focus). Do tell him at the end of the interview that you'd be highly interested to learn more about this topic.
You should have asked him how did he reached to that conclusion. Under any decent compiler out there, the two compile to the same asm instructions. So, he should have told you the compiler as well to start off. And even so, you would have to know the compiler and platform very well to even make a theoretical educated guess. And in the end, it doesn't really matter in practice, since there are other external factors like memory fragmentation or system load that will influence the loop more than this detail.
For the sake of this question, I should that add I remember Doug Gwyn from C Committee writing that some early C compilers without the optimizer pass would generate a test in assembly for the while(1) (comparing to for(;;) which wouldn't have it).
I would answer to the interviewer by giving this historical note and then say that even if I would be very surprised any compiler did this, a compiler could have:
without optimizer pass the compiler generate a test for both while(1) and while(2)
with optimizer pass the compiler is instructed to optimize (with an unconditional jump) all while(1) because they are considered as idiomatic. This would leave the while(2) with a test and therefore makes a performance difference between the two.
I would of course add to the interviewer that not considering while(1) and while(2) the same construct is a sign of low-quality optimization as these are equivalent constructs.
Another take on such a question would be to see if you got courage to tell your manager that he/she is wrong! And how softly you can communicate it.
My first instinct would have been to generate assembly output to show the manager that any decent compiler should take care of it, and if it's not doing so, you will submit the next patch for it :)
To see so many people delve into this problem, shows exactly why this could very well be a test to see how quickly you want to micro-optimize things.
My answer would be; it doesn't matter that much, I rather focus on the business problem which we are solving. After all, that's what I'm going to be paid for.
Moreover, I would opt for while(1) {} because it is more common, and other teammates would not need to spend time to figure out why someone would go for a higher number than 1.
Now go write some code. ;-)
If you're that worried about optimisation, you should use
for (;;)
because that has no tests. (cynic mode)
It seems to me this is one of those behavioral interview questions masked as a technical question. Some companies do this - they will ask a technical question that should be fairly easy for any competent programmer to answer, but when the interviewee gives the correct answer, the interviewer will tell them they are wrong.
The company wants to see how you will react in this situation. Do you sit there quietly and don't push that your answer is correct, due to either self-doubt or fear of upsetting the interviewer? Or are you willing to challenge a person in authority who you know is wrong? They want to see if you are willing to stand up for your convictions, and if you can do it in a tactful and respectful manner.
Here's a problem: If you actually write a program and measure its speed, the speed of both loops could be different! For some reasonable comparison:
unsigned long i = 0;
while (1) { if (++i == 1000000000) break; }
unsigned long i = 0;
while (2) { if (++i == 1000000000) break; }
with some code added that prints the time, some random effect like how the loop is positioned within one or two cache lines could make a difference. One loop might by pure chance be completely within one cache line, or at the start of a cache line, or it might to straddle two cache lines. And as a result, whatever the interviewer claims is fastest might actually be fastest - by coincidence.
Worst case scenario: An optimising compiler doesn't figure out what the loop does, but figures out that the values produced when the second loop is executed are the same ones as produced by the first one. And generate full code for the first loop, but not for the second.
I used to program C and Assembly code back when this sort of nonsense might have made a difference. When it did make a difference we wrote it in Assembly.
If I were asked that question I would have repeated Donald Knuth's famous 1974 quote about premature optimization and walked if the interviewer didn't laugh and move on.
Maybe the interviewer posed such dumb question intentionally and wanted you to make 3 points:
Basic reasoning. Both loops are infinite, it's hard to talk about performance.
Knowledge about optimisation levels. He wanted to hear from you if you let the compiler do any optimisation for you, it would optimise the condition, especially if the block was not empty.
Knowledge about microprocessor architecture. Most architectures have a special CPU instruction for comparision with 0 (while not necessarily faster).
They are both equal - the same.
According to the specifications anything that is not 0 is considered true, so even without any optimization, and a good compiler will not generate any code
for while(1) or while(2). The compiler would generate a simple check for != 0.
Judging by the amount of time and effort people have spent testing, proving, and answering this very straight forward question I'd say that both were made very slow by asking the question.
And so to spend even more time on it...
while (2) is ridiculous, because,
while (1), and while (true) are historically used to make an infinite loop which expects break to be called at some stage inside the loop based upon a condition that will certainly occur.
The 1 is simply there to always evaluate to true and therefore, to say while (2) is about as silly as saying while (1 + 1 == 2) which will also evaluate to true.
And if you want to be completely silly just use: -
while (1 + 5 - 2 - (1 * 3) == 0.5 - 4 + ((9 * 2) / 4.0)) {
if (succeed())
break;
}
I think that the interviewer made a typo which did not effect the running of the code, but if he intentionally used the 2 just to be weird then sack him before he puts weird statements all through your code making it difficult to read and work with.
That depends on the compiler.
If it optimizes the code, or if it evaluates 1 and 2 to true with the same number of instructions for a particular instruction set, the execution speed will be the same.
In real cases it will always be equally fast, but it would be possible to imagine a particular compiler and a particular system when this would be evaluated differently.
I mean: this is not really a language (C) related question.
Since people looking to answer this question want the fastest loop, I would have answered that both are equally compiling into the same assembly code, as stated in the other answers. Nevertheless you can suggest to the interviewer using 'loop unrolling'; a do {} while loop instead of the while loop.
Cautious: You need to ensure that the loop would at least always run once.
The loop should have a break condition inside.
Also for that kind of loop I would personally prefer the use of do {} while(42) since any integer, except 0, would do the job.
The obvious answer is: as posted, both fragments would run an equally busy infinite loop, which makes the program infinitely slow.
Although redefining C keywords as macros would technically have undefined behavior, it is the only way I can think of to make either code fragment fast at all: you can add this line above the 2 fragments:
#define while(x) sleep(x);
it will indeed make while(1) twice as fast (or half as slow) as while(2).
The only reason I can think of why the while(2) would be any slower is:
The code optimizes the loop to
cmp eax, 2
When the subtract occurs you're essentially subtracting
a. 00000000 - 00000010 cmp eax, 2
instead of
b. 00000000 - 00000001 cmp eax, 1
cmp only sets flags and does not set a result. So on the least significant bits we know if we need to borrow or not with b. Whereas with a you have to perform two subtractions before you get a borrow.
Related
mov ecx, 16
looptop: .
.
.
loop looptop
How many times will this loop execute?
What happens if ecx = 0 to start with? Does loop jump or fall-through in that case?
loop is exactly like dec ecx / jnz, except it doesn't set flags.
It's like the bottom of a do {} while(--ecx != 0); in C. If execution enters the loop with ecx = 0, wrap-around means the loop will run 2^32 times. (Or 2^64 times in 64-bit mode, because it uses RCX.)
Unlike rep movsb/stosb/etc., it doesn't check for ECX=0 before decrementing, only after1.
The address-size determines whether it uses CX, ECX, or RCX. So in 64-bit code, addr32 loop is like dec ecx / jnz, while a regular loop is like dec rcx / jnz. Or in 16-bit code, it normally uses CX, but an address-size prefix (0x67) will make it use ecx. As Intel's manual says, it ignores REX.W, because that sets the operand-size, not the address-size.
rep string instructions use the address-size prefix the same way, overriding the address size but also RCX vs. ECX (or CX vs. ECX in modes other than 64-bit). The operand-size for string instructions is already used to determine movsw vs. movsd vs. movsq, and you want address/repeat size to be orthogonal to that. Having loop and jrcxz/jecxz follow that behaviour is just continuing the design intent from 8086 of loop being intended for use with string operations when a simple rep couldn't get the job done; see below.
Related: Why are loops always compiled into "do...while" style (tail jump)? for more about loop structure in asm, while() {} vs. do {} while() and how to lay them out.
Footnote 1: jcxz (or x86-64 jrcxz) was intended for use before the top of a do {} while style loop, to skip it if it should run 0 times. On modern CPUs test rcx, rcx / jz is more efficient.
Stephen Morse, architect of 8086, wrote about the intended uses of loop/jcxz with string instructions in that section of his book, The 8086 Primer, available for free on his web site: https://www.stevemorse.org/8086/index.html. See the "complex string instructions" subsection, starting at the bottom of page 71. (Or start reading from earlier in the chapter, the whole String Instructions section starts on page 66. But note #ecm's review of a few things the book seems to explain poorly or incorrectly.)
If you're wondering about the design intent of x86 instructions, you won't find a better source than this. That's separate from the best / most efficient way to use them, especially on modern x86, but very good intro for beginners into what you can do with asm instructions as building blocks.
Extra debugging tips
If you ever want to know the details on an instruction, check the manual: either Intel's official vol.2 PDF instruction set reference manual, or an html extract with each entry on a different page (http://felixcloutier.com/x86/). But note that the HTML leaves out the intro and appendices that have details on how to interpret stuff, like when it says "flags are set according to the result" for instructions like add.
And you can (and should) also just try stuff in a debugger: single-step and watch registers change. Use a smaller starting value for ecx so you get to the interesting ecx=1 part sooner. See also the x86 tag wiki for links to manuals, guides, and asm debugging tips at the bottom.
And BTW, if the instructions inside the loop that aren't shown modify ecx, it could loop any number of times. For the question to have a simple and unique answer, you need a guarantee that the instructions between the label and the loop instruction don't modify ecx. (They could save/restore it, but if you're going to do that it's usually better to just use a different register as the loop counter. push/pop inside a loop makes your code hard to read.)
Rant about over-use of LOOP even when you already need to increment something else in the loop. LOOP isn't the only way to loop, and usually it's the worst.
You should normally never use the loop instruction unless optimizing for code-size at the expense of speed, because it's slow. Compilers don't use it. (So CPU vendors don't bother to make it fast; catch 22.) Use dec / jnz, or an entirely different loop condition. (See also http://agner.org/optimize/ to learn more about what's efficient.)
Loops don't even have to use a counter; it's often just as good if not better to compare a pointer to an end address, or to check for some other condition. (Pointless use of loop is one of my pet peeves, especially when you already have something in another register that would work as a loop counter.) Using cx as a loop counter often just ties up one of your precious few registers when you could have used cmp/jcc on another register you were incrementing anyway.
IMO, loop should be considered one of those obscure x86 instructions that beginners shouldn't be distracted with. Like stosd (without a rep prefix), aam or xlatb. It does have real uses when optimizing for code size, though. (That's sometimes useful in real life for machine code (like for boot sectors), not just for stuff like code golf.)
IMO, just teach / learn how conditional branches work, and how to make loops out of them. Then you won't get stuck into thinking there's something special about a loop that uses loop. I've seen an SO question or comment that said something like "I thought you had to declare loops", and didn't realize that loop was just an instruction.
</rant>. Like I said, loop is one of my pet peeves. It's an obscure code-golfing instruction, unless you're optimizing for an actual 8086.
I have a switch case program:
Ascending order switch cases :
int main()
{
int a, sc = 1;
switch (sc)
{
case 1:
a = 1;
break;
case 2:
a = 2;
break;
}
}
Assembly of code:
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 1
mov eax, DWORD PTR [rbp-4]
cmp eax, 1
je .L3
cmp eax, 2
je .L4
jmp .L2
.L3:
mov DWORD PTR [rbp-8], 1
jmp .L2
.L4:
mov DWORD PTR [rbp-8], 2
nop
.L2:
mov eax, 0
pop rbp
ret
Descending order switch cases:
int main()
{
int a, sc = 1;
switch (sc)
{
case 2:
a = 1;
break;
case 1:
a = 2;
break;
}
}
Assembly of code:
main:
push rbp
mov rbp, rsp
mov DWORD PTR [rbp-4], 1
mov eax, DWORD PTR [rbp-4]
cmp eax, 1
je .L3
cmp eax, 2
jne .L2
mov DWORD PTR [rbp-8], 1
jmp .L2
.L3:
mov DWORD PTR [rbp-8], 2
nop
.L2:
mov eax, 0
pop rbp
ret
Here, ascending order cases generated more assembly than descending order.
So, if I have more numbers of switch cases, then does the order of cases affect performance?
You're looking at unoptimized code, so studying it for performance isn't very meaningful. If you look at optimized code for your examples, you'll find that it doesn't do the comparisons at all! The optimizer notices that the switch variable sc always has the value 1, so it removes the unreachable case 2.
The optimizer also sees that the variable a isn't used after it's assigned, so it removes the code in case 1 as well, leaving main() an empty function. And it removes the function prolog/epilog that manipulates rbp since that register is unused.
So the optimized code ends up the same for either version of your main() function:
main:
xor eax, eax
ret
In short, for the code in the question, it doesn't matter which order you put the case statements, because none of that code will be generated at all.
Would the case order matter in a more real-life example where the code actually is generated and used? Probably not. Note that even in your unoptimized generated code, both versions test for the two case values in numeric order, checking first for 1 and then for 2, regardless of the order in the source code. Clearly the compiler is doing some sorting even in the unoptimized code.
Be sure to note Glenn and Lundin's comments: the order of the case sections is not the only change between your two examples, the actual code is different too. In one of them, the case values match the values set into a, but not so in the other.
Compilers use various strategies for switch/case statements depending on the actual values used. They may use a series of comparisons as in these examples, or perhaps a jump table. It can be interesting to study the generated code, but as always, if performance matters, watch your optimization settings and test it in a real-life situation.
Compiler optimization of switch statements is tricky. Of course, you need to enable optimizations (e.g. try to compile your code with gcc -O2 -fverbose-asm -S with GCC and look inside the generated .s assembler file). BTW on both of your examples my GCC 7 on Debian/Sid/x86-64 gives simply:
.type main, #function
main:
.LFB0:
.cfi_startproc
# rsp.c:13: }
xorl %eax, %eax #
ret
.cfi_endproc
(so there in no trace of switch in that generated code)
If you need to understand how a compiler could optimize switch, there are some papers on that subject, such as this one.
If I have more numbers of switch cases, then an order of cases effect on performance?
Not in general, if you are using some optimizing compiler and asking it to optimize. See also this.
If that matters to you so much (but it should not, leave micro-optimizations to your compiler!), you need to benchmark, to profile and perhaps to study the generated assembler code. BTW, cache misses and register allocation could matter much more than order of case-s so I think you should not bother at all. Keep in mind the approximate timing estimates of recent computers. Put the cases in the most readable order (for the next developer working on that same source code). Read also about threaded code. If you have objective (performance related) reasons to re-order the case-s (which is very unlikely and should happen at most once in your lifetime), write some good comment explaining those reasons.
If you care that much about performance, be sure to benchmark and profile, and choose a good compiler and use it with relevant optimization options. Perhaps experiment several different optimization settings (and maybe several compilers). You may want to add -march=native (in addition of -O2 or -O3). You could consider compiling and linking with -flto -O2 to enable link-time optimizations, etc. You might also want profile based optimizations.
BTW, many compilers are huge free software projects (in particular GCC and Clang). If you care that much about performance, you might patch the compiler, extend it by adding some additional optimization pass (by forking the source code, by adding some plugin to GCC or some GCC MELT extensions). That requires months or years of work (notably to understand the internal representations and organization of that compiler).
(Don't forget to take development costs into account; in most cases, they cost much more)
Performance would depend mostly of the number of branch misses for a given dataset, not so much on the total number of cases. And that in turn depends highly on the actual data and how the compiler chose to implement the switch (dispatch table, chained conditionals, tree of conditionals -- not sure if you can even control this from C).
In cases where most case labels are consecutive, compilers will often process switch statements to use jump tables rather than comparisons. The exact means by which compilers decide what form of computed jump to use (if any) will vary among different implementations. Sometimes adding extra cases to a switch statement may improve performance by simplifying a compiler's generated code (e.g. if code uses cases 4-11, while cases 0-3 are handled in default fashion, adding explicit case 0:; case 1:; case 2:; case 3:; prior to the default: may result in the compiler comparing the operand against 12 and, if it's less, using a 12-item jump table. Omitting those cases might cause the compiler to subtract 4 before comparing the difference against 8, and then use an 8-item table.
One difficulty in trying to optimize switch statements is that compilers generally know better than programmers how the performance of different approaches would vary when given certain inputs, but programmers may know better than compilers what distribution of inputs a program would receive. Given something like:
if (x==0)
y++;
else switch(x)
{
...
}
a "smart" compiler might recognize that changing the code to:
switch(x)
{
case 0:
y++;
break;
...
}
could eliminate a comparison in all cases where x is non-zero, at the cost
of a computed jump when x is zero. If x is non-zero most of the time,
that would be a good trade. If x is zero 99.9% of the time, however, that
might be a bad trade. Different compiler writers differ as to the extent to
which they will try to optimize constructs like former into the latter.
The switch statement is usually compiled via jump tables not by simple comparaisons.
So, there is no loss in performance if you permute the case-statements.
However, sometimes it is useful to keep more cases in consecutive order and not to use break/return in some entries, in order for the flow of execution to go to the next case and avoid duplicating the code.
When the differences of numbers between case number are big from one case to the other, such in case 10: and case 200000: the compiler will surely not generate jump tables, as it should fill about 200K entries almost all with a pointer toward the default: case, and in this case it will use comparaisons.
Your question is very simple - your code isn't the same, so it won't produce the same assembly! Optimised code doesn't just depend on the individual statements, but also on everything around it. And in this case, it's easy to explain the optimisation.
In your first example, case 1 results in a=1, and case 2 results in a=2. The compiler can optimise this to set a=sc for those two cases, which is a single statement.
In your second example, case 1 results in a=2, and case 2 results in a=1. The compiler can no longer take that shortcut, so it has to explicitly set a=1 or a=2 for both cases. Of course this needs more code.
If you simply took your first example and swapped the order of the cases and conditional code then you should get the same assembler.
You can test this optimisation by using the code
int main()
{
int a, sc = 1;
switch (sc)
{
case 1:
case 2:
a = sc;
break;
}
}
which should also give exactly the same assembler.
Incidentally, your test code assumes that sc is actually read. Most modern optimising compilers are able to spot that sc does not change between assignment and the switch statement, and replace reading sc with a constant value 1. Further optimisation will then remove the redundant branch(es) of the switch statement, and then even the assignment could be optimised away because a does not actually change. And from the point of view of variable a, the compiler may also discover that a is not read elsewhere and so remove that variable from the code completely.
If you really want sc to be read and a to be set, you need to declare them both volatile. Fortunately the compiler seems to have implemented it the way you expected - but you absolutely cannot expect this when you have optimisation turned on.
You should probably enable the optimisations for your compiler before comparing assembly code, however the problem is that your variable is known at compile time so the compiler can remove everything from your function because it doesn't have any side effects.
This example shows that even if you change the order of the cases in a switch statement in your example, GCC and most other compilers will reorder them if the optimisations are enabled.
I used extern functions to make sure the values are only known at run-time but I could also have used rand for example.
Also, when you add more cases, the compiler may replace the conditional jumps by a table which contains the addresses of the functions and it will still get reordered by GCC as can be seen here.
mov ecx, 16
looptop: .
.
.
loop looptop
How many times will this loop execute?
What happens if ecx = 0 to start with? Does loop jump or fall-through in that case?
loop is exactly like dec ecx / jnz, except it doesn't set flags.
It's like the bottom of a do {} while(--ecx != 0); in C. If execution enters the loop with ecx = 0, wrap-around means the loop will run 2^32 times. (Or 2^64 times in 64-bit mode, because it uses RCX.)
Unlike rep movsb/stosb/etc., it doesn't check for ECX=0 before decrementing, only after1.
The address-size determines whether it uses CX, ECX, or RCX. So in 64-bit code, addr32 loop is like dec ecx / jnz, while a regular loop is like dec rcx / jnz. Or in 16-bit code, it normally uses CX, but an address-size prefix (0x67) will make it use ecx. As Intel's manual says, it ignores REX.W, because that sets the operand-size, not the address-size.
rep string instructions use the address-size prefix the same way, overriding the address size but also RCX vs. ECX (or CX vs. ECX in modes other than 64-bit). The operand-size for string instructions is already used to determine movsw vs. movsd vs. movsq, and you want address/repeat size to be orthogonal to that. Having loop and jrcxz/jecxz follow that behaviour is just continuing the design intent from 8086 of loop being intended for use with string operations when a simple rep couldn't get the job done; see below.
Related: Why are loops always compiled into "do...while" style (tail jump)? for more about loop structure in asm, while() {} vs. do {} while() and how to lay them out.
Footnote 1: jcxz (or x86-64 jrcxz) was intended for use before the top of a do {} while style loop, to skip it if it should run 0 times. On modern CPUs test rcx, rcx / jz is more efficient.
Stephen Morse, architect of 8086, wrote about the intended uses of loop/jcxz with string instructions in that section of his book, The 8086 Primer, available for free on his web site: https://www.stevemorse.org/8086/index.html. See the "complex string instructions" subsection, starting at the bottom of page 71. (Or start reading from earlier in the chapter, the whole String Instructions section starts on page 66. But note #ecm's review of a few things the book seems to explain poorly or incorrectly.)
If you're wondering about the design intent of x86 instructions, you won't find a better source than this. That's separate from the best / most efficient way to use them, especially on modern x86, but very good intro for beginners into what you can do with asm instructions as building blocks.
Extra debugging tips
If you ever want to know the details on an instruction, check the manual: either Intel's official vol.2 PDF instruction set reference manual, or an html extract with each entry on a different page (http://felixcloutier.com/x86/). But note that the HTML leaves out the intro and appendices that have details on how to interpret stuff, like when it says "flags are set according to the result" for instructions like add.
And you can (and should) also just try stuff in a debugger: single-step and watch registers change. Use a smaller starting value for ecx so you get to the interesting ecx=1 part sooner. See also the x86 tag wiki for links to manuals, guides, and asm debugging tips at the bottom.
And BTW, if the instructions inside the loop that aren't shown modify ecx, it could loop any number of times. For the question to have a simple and unique answer, you need a guarantee that the instructions between the label and the loop instruction don't modify ecx. (They could save/restore it, but if you're going to do that it's usually better to just use a different register as the loop counter. push/pop inside a loop makes your code hard to read.)
Rant about over-use of LOOP even when you already need to increment something else in the loop. LOOP isn't the only way to loop, and usually it's the worst.
You should normally never use the loop instruction unless optimizing for code-size at the expense of speed, because it's slow. Compilers don't use it. (So CPU vendors don't bother to make it fast; catch 22.) Use dec / jnz, or an entirely different loop condition. (See also http://agner.org/optimize/ to learn more about what's efficient.)
Loops don't even have to use a counter; it's often just as good if not better to compare a pointer to an end address, or to check for some other condition. (Pointless use of loop is one of my pet peeves, especially when you already have something in another register that would work as a loop counter.) Using cx as a loop counter often just ties up one of your precious few registers when you could have used cmp/jcc on another register you were incrementing anyway.
IMO, loop should be considered one of those obscure x86 instructions that beginners shouldn't be distracted with. Like stosd (without a rep prefix), aam or xlatb. It does have real uses when optimizing for code size, though. (That's sometimes useful in real life for machine code (like for boot sectors), not just for stuff like code golf.)
IMO, just teach / learn how conditional branches work, and how to make loops out of them. Then you won't get stuck into thinking there's something special about a loop that uses loop. I've seen an SO question or comment that said something like "I thought you had to declare loops", and didn't realize that loop was just an instruction.
</rant>. Like I said, loop is one of my pet peeves. It's an obscure code-golfing instruction, unless you're optimizing for an actual 8086.
How much faster is the following assembler code:
shl ax, 1
Versus the following C code:
num = num * 2;
How can I even find out?
Your assembly variant might be faster, might be slower. What made you think that it is necessarily faster?
On the x86 platform, there are quite a few ways to multiply something by 2. I would expect a compiler to do add ax, ax, which is intuitively more efficient than your shl because it doesn't involve a potentially stored constant ('1' in your case).
Also, for quite a long time, on a x86 platform the preferred way of multiplying things by constants was not a shift, but rather a lea operation (when possible). In the above example that would be lea eax, [eax*2]. (Multiplication by 3 would be done through lea eax, [eax*2+eax])
The belief in shift operations being somehow "faster" is a nice old story for newbies, which has virtually no relevance today. And, as usual, most of the time your compiler (if it is up-to-date) has much better knowledge about the underlying hardware platform than people with naive love for shift operations.
Is this, by any chance, an academic question? I assume you understand it is in the general category of "getting a haircut to lose weight".
If you are using GCC, ask to see the generated assembly with option -S. You may find it's the same as your assembler instruction.
To answer the original question, on Out-Of-Order processors instruction speed is measured by throughput and latency, and you would measure both using the rdtsc assembly instruction. But someone else did it for you for a lot of processors, so you don't need to bother. PDF
In most circumstances, it won't make a difference. Multiplication is fast on nearly all modern hardware. In particular, it is usually fast enough that unless you have meticulously hand-optimized code, the pipeline will hide the entirety of the latency and you will see no speed difference at all between the two cases.
You may be able to measure a performance difference on multiplies and shifts when you execute them in isolation, but there will typically not be any difference in the context of the rest of your compiled code. (As I noted, this may not hold true if the code is meticulously optimized).
Now, that said, shifts are still generally faster than multiplies, and almost any reasonable compiler will map a fixed power-of-two multiply into a shift, anyway (assuming that the semantics are actually equivalent on the target architecture).
Edit: one more thing you may want to try if you really care about this is x+x. I know of at least one architecture on which this can actually be faster than shifting, depending on the surrounding context.
If you have a decent compiler it will produce the same or similar code. The best way is to disassemble and checked the created code.
The answer depends, as you've been able to see here, upon many things. What the compiler will do with your C code depends on a lot of things. If we're talking x86-32 the following should be generally applicable.
At the basic level your C code indicates a memory variable which would require at least one instruction to multiply by two: "shl mem,1" and in such a simple case the C code will be slower.
If num is a local variable the compiler may decide to put it in a register (if it is used often enough and/or the function is small enough) and then you will have your "shl reg,1" instruction - maybe.
What instruction is fastest has everything to do with how they are implemented in the processor. Shl may not be the best choice since it affects the C and Z flags which slows it down. A few years ago the recommendation was "lea reg,[reg+reg]" (all reg are the same) because lea didn't affect any flags and there were variants such as (using the eax register on x86-32 platform as an example):
lea eax,[eax+eax] ; *2
lea eax,[eax+eax*2] ; *3
lea eax,[eax+eax*4] ; *5
lea eax,[eax+eax*8] ; *9
I don't know what is the norm today but your compiler probably will.
As for measuring search for information here on the rdtsc instruction which is the hands-down best alternative as it counts actual clock cycles.
Put them in a loop with a counter that goes so high that it runs for at least a second in the fastest case. Use your favorite timing mechanism to see how long each takes.
The assembly case should be done with inline assembly in the same C program as you use for the pure C test. Otherwise, you're not comparing apples to apples.
By the way, I think you should add a third test:
num <<= 1;
The question then is whether that does the same thing as the assembly version.
If, for your target platform, shifting left is the quickest way to multiply a number by two, then the chances are your compiler will do that when compiling the code. Look at the disassembly to check
So, for that one line, it's probably exactly the same speed. However, as you're unlikely to have a function containing just that one line, you might well find the compiler would defer the shift until the value is used, or otherwise mix it up with surrounding code, making it less clear cut. A good optimizing compiler will generally do a good job of beating poor to average hand written assembly.
If the compiler up to date now ( vc9 ) was really doing a good job it would outperform vc6 by a wide margin and this dont occur, this is why I even prefer to use VC6 for some code that run faster than code compiled in mingw with -O3 and VC9 with /Ox
On a mailing list I'm subscribed to, two fairly knowledgeable (IMO) programmers were discussing some optimized code, and saying something along the lines of:
On the CPUs released 5-8 years ago, it was slightly faster to iterate for loops backwards (e.g. for (int i=x-1; i>=0; i--) {...}) because comparing i to zero is more efficient than comparing it to some other number. But with very recent CPUs (e.g. from 2008-2009) the speculative loader logic is such that it works better if the for loop is iterated forward (e.g. for (int i=0; i< x; i++) {...}).
My question is, is that true? Have CPU implementations changed recently such that forward-loop-iterating now has an advantage over backward-iterating? If so, what is the explanation for that? i.e. what changed?
(Yes, I know, premature optimization is the root of all evil, review my algorithm before worrying about micro-optimizations, etc etc... mostly I'm just curious)
You're really asking about prefetching, not about loop control logic.
In general, loop performance isn't going to be dictated by the control logic (i.e. the increment/decrement and the condition that gets checked every time through). The time it takes to do these things is inconsequential except in very tight loops. If you're interested in that, take a look at John Knoeller's answer for specifics on the 8086's counter register and why it might've been true in the old days that counting down was more efficient. As John says, branch prediction (and also speculation) can play a role in performance here, as can instruction prefetching.
Iteration order can affect performance significantly when it changes the order in which your loop touches memory. The order in which you request memory addresses can affect what is drawn into your cache and also what is evicted from your cache when there is no longer room to fetch new cache lines. Having to go to memory more often than needed is much more expensive than compares, increments, or decrements. On modern CPUs it can take thousands of cycles to get from the processor to memory, and your processor may have to idle for some or all of that time.
You're probably familiar with caches, so I won't go into all those details here. What you may not know is that modern processors employ a whole slew of prefetchers to try to predict what data you're going to need next at different levels of the memory hierarchy. Once they predict, they try to pull that data from memory or lower level caches so that you have what you need when you get around to processing it. Depending on how well they grab what you need next, your performance may or may not improve when using them.
Take a look at Intel's guide to optimizing for hardware prefetchers. There are four prefetchers listed; two for NetBurst chips:
NetBurst's hardware prefetcher can detect streams of memory accesses in either forward or backward directions, and it will try to load data from those locations into the L2 cache.
NetBurst also has an adjacent cache line (ACL) prefetcher, which will automatically load two adjacent cache lines when you fetch the first one.
and two for Core:
Core has a slightly more sophisticated hardware prefetcher; it can detect strided access in addition to streams of contiguous references, so it'll do better if you step through an array every other element, every 4th, etc.
Core also has an ACL prefetcher like NetBurst.
If you're iterating through an array forward, you're going to generate a bunch of sequential, usually contiguous memory references. The ACL prefetchers are going to do much better for forward loops (because you'll end up using those subsequent cache lines) than for backward loops, but you may do ok making memory references backward if the prefetchers can detect this (as with the hardware prefetchers). The hardware prefetchers on the Core can detect strides, which is helpful for for more sophisticated array traversals.
These simple heuristics can get you into trouble in some cases. For example, Intel actually recommends that you turn off adjacent cache line prefetching for servers, because they tend to make more random memory references than desktop user machines. The probability of not using an adjacent cache line is higher on a server, so fetching data you're not actually going to use ends up polluting your cache (filling it with unwanted data), and performance suffers. For more on addressing this kind of problem, take a look at this paper from Supercomputing 2009 on using machine learning to tune prefetchers in large data centers. Some guys at Google are on that paper; performance is something that is of great concern to them.
Simple heuristics aren't going to help you with more sophisticated algorithms, and you might have to start thinking about the sizes of your L1, L2, etc. caches. Image processing, for example, often requires that you perform some operation on subsections of a 2D image, but the order you traverse the image can affect how well useful pieces of it stay in your cache without being evicted. Take a look at Z-order traversals and loop tiling if you're interested in this sort of thing. It's a pretty basic example of mapping the 2D locality of image data to the 1D locality of memory to improve performance. It's also an area where compilers aren't always able to restructure your code in the best way, but manually restructuring your C code can improve cache performance drastically.
I hope this gives you an idea of how iteration order affects memory performance. It does depend on the particular architecture, but the ideas are general. You should be able to understand prefetching on AMD and Power if you can understand it on Intel, and you don't really have to know assembly to structure your code to take advantage of memory. You just need to know a little computer architecture.
I don't know. But I do know how to write a quick benchmark with no guarantees of scientific validity (actually, one with rather strict guarantees of invalidity). It has interesting results:
#include <time.h>
#include <stdio.h>
int main(void)
{
int i;
int s;
clock_t start_time, end_time;
int centiseconds;
start_time = clock();
s = 1;
for (i = 0; i < 1000000000; i++)
{
s = s + i;
}
end_time = clock();
centiseconds = (end_time - start_time)*100 / CLOCKS_PER_SEC;
printf("Answer is %d; Forward took %ld centiseconds\n", s, centiseconds);
start_time = clock();
s = 1;
for (i = 999999999; i >= 0; i--)
{
s = s + i;
}
end_time = clock();
centiseconds = (end_time - start_time)*100 / CLOCKS_PER_SEC;
printf("Answer is %d; Backward took %ld centiseconds\n", s, centiseconds);
return 0;
}
Compiled with -O9 using gcc 3.4.4 on Cygwin, running on an "AMD Athlon(tm) 64 Processor 3500+" (2211 MHz) in 32 bit Windows XP:
Answer is -1243309311; Forward took 93 centiseconds
Answer is -1243309311; Backward took 92 centiseconds
(Answers varied by 1 either way in several repetitions.)
Compiled with -I9 using gcc 4.4.1 running on an "Intel(R) Atom(TM) CPU N270 # 1.60GHz" (800 MHz and presumably only one core, given the program) in 32 bit Ubuntu Linux.
Answer is -1243309311; Forward took 196 centiseconds
Answer is -1243309311; Backward took 228 centiseconds
(Answers varied by 1 either way in several repetitions.)
Looking at the code, the forward loop is translated to:
; Gcc 3.4.4 on Cygwin for Athlon ; Gcc 4.4.1 on Ubuntu for Atom
L5: .L2:
addl %eax, %ebx addl %eax, %ebx
incl %eax addl $1, %eax
cmpl $999999999, %eax cmpl $1000000000, %eax
jle L5 jne .L2
The backward to:
L9: .L3:
addl %eax, %ebx addl %eax, %ebx
decl %eax subl $1, $eax
jns L9 cmpl $-1, %eax
jne .L3
Which shows, if not much else, that GCC's behaviour has changed between those two versions!
Pasting the older GCC's loops into the newer GCC's asm file gives results of:
Answer is -1243309311; Forward took 194 centiseconds
Answer is -1243309311; Backward took 133 centiseconds
Summary: on the >5 year old Athlon, the loops generated by GCC 3.4.4 are the same speed. On the newish (<1 year?) Atom, the backward loop is significantly faster. GCC 4.4.1 has a slight regression for this particular case which I personally am not bothered about in the least, given the point of it. (I had to make sure that s is used after the loop, because otherwise the compiler would elide the computation altogether.)
[1] I can never remember the command for system info...
Yes. but with a caveat. The idea that looping backwards is faster never applied to all older CPUs. It's an x86 thing (as in 8086 through 486, possibly Pentium, although I don't think any further).
That optimization never applied to any other CPU architecture that I know of.
Here's why.
The 8086 had a register that was specifically optimized for use as a loop counter. You put your loop count in CX, and then there are several instructions that decrement CX and then set condition codes if it goes to zero. In fact there was an instruction prefix you could put before other instructions (the REP prefix) that would basically iterate the other instruction until CX got to 0.
Back in the days when we counted instructions and instructions had known fixed cycle counts using cx as your loop counter was the way to go, and cx was optimized for counting down.
But that was a long time ago. Ever since the Pentium, those complex instructions have been slower overall than using more, and simpler instructions. (RISC baby!) The key thing we try to do these days is try to put some time between loading a register and using it because the pipelines can actually do multiple things per cycle as long as you don't try to use the same register for more than one thing at a time.
Nowdays the thing that kills performance isn't the comparison, it's the branching, and then only when the branch prediction predicts wrong.
I stumbled upon this question after observing a significant drop in performance when iterating over an array backwards vs forwards. I was afraid it would be the prefetcher, but the previous answers convinced me this was not the case. I then investigated further and found out that it looks like GCC (4.8.4) is unable to exploit the full power of SIMD operations in a backward loop.
In fact, compiling the following code (from here) with -S -O3 -mavx:
for (i = 0; i < N; ++i)
r[i] = (a[i] + b[i]) * c[i];
leads to essentially:
.L10:
addl $1, %edx
vmovupd (%rdi,%rax), %xmm1
vinsertf128 $0x1, 16(%rdi,%rax), %ymm1, %ymm1
vmovupd (%rsi,%rax), %xmm0
vinsertf128 $0x1, 16(%rsi,%rax), %ymm0, %ymm0
vaddpd (%r9,%rax), %ymm1, %ymm1
vmulpd %ymm0, %ymm1, %ymm0
vmovupd %xmm0, (%rcx,%rax)
vextractf128 $0x1, %ymm0, 16(%rcx,%rax)
addq $32, %rax
cmpl %r8d, %edx
jb .L10
i.e. assembly code that uses the AVX extensions to perform four double operations in parallel (for example, vaddpd and vmulpd).
Conversely, the following code compiled with the same parameters:
for (i = 0; i < N; ++i)
r[N-1-i] = (a[N-1-i] + b[N-1-i]) * c[N-1-i];
produces:
.L5:
vmovsd a+79992(%rax), %xmm0
subq $8, %rax
vaddsd b+80000(%rax), %xmm0, %xmm0
vmulsd c+80000(%rax), %xmm0, %xmm0
vmovsd %xmm0, r+80000(%rax)
cmpq $-80000, %rax
jne .L5
which only performs one double operation at the time (vaddsd, vmulsd).
This fact alone may be responsible for a factor of 4 between the performance when iterating backward vs forward.
Using -ftree-vectorizer-verbose=2, it looks like the problem is storing backwards: "negative step for store". In fact, if a, b, and c are read backwards, but r is written into in the forward direction, and the code is vectorized again.
It probably doesn't make a hoot of difference speed-wise, but I often write:
for (i = n; --i >= 0; ) blah blah
which I think at one time generated cleaner assembly.
Of course, in answering this kind of question, I run the risk of affirming that this is important. It's a micro-optimization kind of question, which is closely related to premature optimization, which everybody says you shouldn't do, but nevertheless SO is awash in it.
No, we can't say that CPU implementations have changed to make forward looping faster. And that has very little to do with the CPUs themselves.
It has to do with the fact that you haven't specified which CPU you're talking about, nor which compiler.
You cannot ask a blanket question about CPU issues with the C tag and expect to get an intelligent answer simply because nothing in the C standard mandates how fast CPU s should be at various operations.
If you'd like to rephrase your question to target a specific CPU and machine language (since what machine language you get out of a C compiler depends entirely on the compiler), you may get a better answer.
In either case, it shouldn't matter. You should be relying on the fact that the people that wrote your compiler know a great deal more than you about how to eke the last inch of performance from the various CPUs.
The direction in which you should be iterating has always been dictated by what you have to do. For example, if you have to process array elements in ascending order, you use:
for (i = 0; i < 1000; i++) { process (a[i]); }
rather than:
for (i = 999; i >= 0; i--) { process (a[999-i]); }
simply because any advantage you may gain in going backwards is more than swamped by the extra calculations on i. It may well be that a naked loop (no work done in the body) may be faster in one direction than another but, if you have such a naked loop, it's not doing any real work anyway.
As an aside, it may well be that both those loops above will come down to the same machine code anyway. I've seen some of the code put out by the GCC optimizer and it made my head spin. Compiler writers are, in my opinion, a species alone when it comes to insane levels of optimization.
My advice: always program for readability first then target any specific performance problems you have ("get it working first, then get it working fast").
When optimizing loops I'd rather look into loop unrolling (as it cuts down the number of comparisons vs. the exit value, and it may be optimized for parallel processing (MMX) depending on what goes on inside the loop).