Related
Example:
for (int i = 0; i < a[index]; i++) {
// do stuff
}
Would a[index] be read every time? If no, what if someone wanted to change the value at a[index] in the loop? I've never seen it myself, but does the compiler make such an assumption?
If the condition was instead i < val-2, would it be evaluated every time?
The compiler will perform optimizations normally when the system is not impacted by other parts of the program. So if you make changes inside the for loop on the condition parameter, the compiler will not optimize.
As mentioned, the compiler should read the array and check it before each iteration in your code snippet. You can optimize your code as follows, then it will read the array only once for loop condition checking.
int cond = a[index];
for (int i = 0; i < cond; i++) {
// do stuff
}
well, maybe.
A standards compliant compiler will produce code that behaves as-if it
is read every time.
If index and/or array are of storage class volatile the they will be re-evaluated every time.
If they are not and the loops content doesn't use them in a way that can be expected to modify their value the optimiser may decide to use a cached result instead.
Co does not store results of expressions in temporary variables. So, all expressions re evaluated in-place. Note that any for loop can be changed to a while loop:
for ( def_or_expr1 ; expr2 ; expr3 ) {
...
}
becomes:
def_or_expr1;
while ( expr2 ) {
...
cont:
expr3;
}
Update: continue in the for loop would be the same as goto cont; int the while loop above. I.e. expr3 is evaluated for every iteration.
The compiler can bascially apply any optimization it can proof not to change the program's essence. Describing full details would be too far for this, but in general, it can (and will) optimize:
a[index] is not changed in the loop: read once before loop and keep in a temp (e.g. register).
a[index] is changed in the loop: update the temp (register) with the new value, avoiding memory access (and the index calculations).
For this, the compiler must assume the array is not changed outside the visible control flow. This is typically the file being compiled (with all included files). For modern systems using link time optimization (LTO), this can be the whole final program - minus dynamic libraries.
Note this is a very brief description. Actually, the C standard defines pretty clear how a program has to be executed, so what/how the compiler may optimize.
If the array is changed, for example by an interrupt handler or another thread, things become complicated. Depending on your target, you need from volatile, atomic operations (stdatomic.h, since C11) up to thread locks/mutexes/semapores/etc. to control accesses to the share resource.
Is memset() more efficient than for loop.
Considering this code:
char x[500];
memset(x,0,sizeof(x));
And this one:
char x[500];
for(int i = 0 ; i < 500 ; i ++) x[i] = 0;
Which one is more efficient and why? Is there any special instruction in hardware to do block level initialization.
Most certainly, memset will be much faster than that loop. Note how you treat one character at a time, but those functions are so optimized that set several bytes at a time, even using, when available, MMX and SSE instructions.
I think the paradigmatic example of these optimizations, that go unnoticed usually, is the GNU C library strlen function. One would think that it has at least O(n) performance, but it actually has O(n/4) or O(n/8) depending on the architecture (yes, I know, in big O() will be the same, but you actually get an eighth of the time). How? Tricky, but nicely: strlen.
Well, why don't we take a look at the generated assembly code, full optimization under VS 2010.
char x[500];
char y[500];
int i;
memset(x, 0, sizeof(x) );
003A1014 push 1F4h
003A1019 lea eax,[ebp-1F8h]
003A101F push 0
003A1021 push eax
003A1022 call memset (3A1844h)
And your loop...
char x[500];
char y[500];
int i;
for( i = 0; i < 500; ++i )
{
x[i] = 0;
00E81014 push 1F4h
00E81019 lea eax,[ebp-1F8h]
00E8101F push 0
00E81021 push eax
00E81022 call memset (0E81844h)
/* note that this is *replacing* the loop,
not being called once for each iteration. */
}
So, under this compiler, the generated code is exactly the same. memset is fast, and the compiler is smart enough to know that you are doing the same thing as calling memset once anyway, so it does it for you.
If the compiler actually left the loop as-is then it would likely be slower as you can set more than one byte size block at a time (i.e., you could unroll your loop a bit at a minimum. You can assume that memset will be at least as fast as a naive implementation such as the loop. Try it under a debug build and you will notice that the loop is not replaced.
That said, it depends on what the compiler does for you. Looking at the disassembly is always a good way to know exactly what is going on.
It really depends on the compiler and library. For older compilers or simple compilers, memset may be implemented in a library and would not perform better than a custom loop.
For nearly all compilers that are worth using, memset is an intrinsic function and the compiler will generate optimized, inline code for it.
Others have suggested profiling and comparing, but I wouldn't bother. Just use memset. Code is simple and easy to understand. Don't worry about it until your benchmarks tell you this part of code is a performance hotspot.
The answer is 'it depends'. memset MAY be more efficient, or it may internally use a for loop. I can't think of a case where memset will be less efficient. In this case, it may turn into a more efficient for loop: your loop iterates 500 times setting a bytes worth of the array to 0 every time. On a 64 bit machine, you could loop through, setting 8 bytes (a long long) at a time, which would be almost 8 times quicker, and just dealing with the remaining 4 bytes (500%8) at the end.
EDIT:
in fact, this is what memset does in glibc:
http://repo.or.cz/w/glibc.git/blob/HEAD:/string/memset.c
As Michael pointed out, in certain cases (where the array length is known at compile time), the C compiler can inline memset, getting rid of the overhead of the function call. Glibc also has assembly optimized versions of memset for most major platforms, like amd64:
http://repo.or.cz/w/glibc.git/blob/HEAD:/sysdeps/x86_64/memset.S
Good compilers will recognize the for loop and replace it with either an optimal inline sequence or a call to memset. They will also replace memset with an optimal inline sequence when the buffer size is small.
In practice, with an optimizing compiler the generated code (and therefore performance) will be identical.
Agree with above. It depends. But, for sure memset is faster or equal to the for-loop. If you are uncertain of your environment or too lazy to test, take the safe route and go with memset.
Other techniques like loop unrolling which reduce the number of loops can also be used. The code of memset() can mimic the famous duff's device:
void *duff_memset(char *to, int c, size_t count)
{
size_t n;
char *p = to;
n = (count + 7) / 8;
switch (count % 8) {
case 0: do { *p++ = c;
case 7: *p++ = c;
case 6: *p++ = c;
case 5: *p++ = c;
case 4: *p++ = c;
case 3: *p++ = c;
case 2: *p++ = c;
case 1: *p++ = c;
} while (--n > 0);
}
return to;
}
Those tricks used to enhancing the execution speed in the past. But on modern architectures this tends to increase the code size and increase cache misses.
So, it is quite impossible to say which implementation is faster as it depends on the quality of the compiler optimizations, the ability of the C library to take advantage of special hardware instructions, the amount of data you are operating on and the features of the underlying operating system (page faults management, TLB misses, Copy-On-Write).
For example, in the glibc, the implementation of memset() as well as various other "copy/set" functions like bzero() or strcpy() are architecture dependent to take advantage of various optimized hardware instructions like SSE or AVX.
Many times I need to do things TWICE in a for loop. Simply I can set up a for loop with an iterator and go through it twice:
for (i = 0; i < 2; i++)
{
// Do stuff
}
Now I am interested in doing this as SIMPLY as I can, perhaps without an initializer or iterator? Are there any other, really simple and elegant, ways of achieving this?
This is elegant because it looks like a triangle; and triangles are elegant.
i = 0;
here: dostuff();
i++; if ( i == 1 ) goto here;
Encapsulate it in a function and call it twice.
void do_stuff() {
// Do Stuff
}
// .....
do_stuff();
do_stuff();
Note: if you use variables or parameters of the enclosing function in the stuff logic, you can pass them as arguments to the extracted do_stuff function.
If its only twice, and you want to avoid a loop, just write the darn thing twice.
statement1;
statement1; // (again)
If the loop is too verbose for you, you can also define an alias for it:
#define TWICE for (int _index = 0; _index < 2; _index++)
This would result into that code:
TWICE {
// Do Stuff
}
// or
TWICE
func();
I would only recommend to use this macro if you have to do this very often, I think else the plain for-loop is more readable.
Unfortunately, this is not for C, but for C++ only, but does exactly what you want:
Just include the header, and you can write something like this:
10 times {
// Do stuff
}
I'll try to rewrite it for C as well.
So, after some time, here's an approach that enables you to write the following in pure C:
2 times {
do_something()
}
Example:
You'll have to include this little thing as a simple header file (I always called the file extension.h). Then, you'll be able to write programs in the style of:
#include<stdio.h>
#include"extension.h"
int main(int argc, char** argv){
3 times printf("Hello.\n");
3 times printf("Score: 0 : %d\n", _);
2 times {
printf("Counting: ");
9 times printf("%d ", _);
printf("\n");
}
5 times {
printf("Counting up to %d: ", _);
_ times printf("%d ", _);
printf("\n");
}
return 0;
}
Features:
Simple notation of simple loops (in the style depicted above)
Counter is implicitly stored in a variable called _ (a simple underscore).
Nesting of loops allowed.
Restrictions (and how to (partially) circumvent them):
Works only for a certain number of loops (which is - "of course" - reasonable, since you only would want to use such a thing for "small" loops). Current implementation supports a maximum of 18 iterations (higher values result in undefined behaviour). Can be adjusted in header file by changing the size of array _A.
Only a certain nesting depth is allowed. Current implementation supports a nesting depth of 10. Can be adjusted by redefining the macro _Y.
Explanation:
You can see the full (=de-obfuscated) source-code here. Let's say we want to allow up to 18 loops.
Retrieving upper iteration bound: The basic idea is to have an array of chars that are initially all set to 0 (this is the array counterarray). If we issue a call to e.g. 2 times {do_it;}, the macro times shall set the second element of counterarray to 1 (i.e. counterarray[2] = 1). In C, it is possible to swap index and array name in such an assignment, so we can write 2[counterarray] = 1 to acchieve the same. This is exactly what the macro times does as first step. Then, we can later scan the array counterarray until we find an element that is not 0, but 1. The corresponding index is then the upper iteration bound. It is stored in variable searcher. Since we want to support nesting, we have to store the upper bound for each nesting depth separately, this is done by searchermax[depth]=searcher+1.
Adjusting current nesting depth: As said, we want to support nesting of loops, so we have to keep track of the current nesting depth (done in the variable depth). We increment it by one if we start such a loop.
The actual counter variable: We have a "variable" called _ that implicitly gets assigned the current counter. In fact, we store one counter for each nesting depth (all stored in the array counter. Then, _ is just another macro that retrieves the proper counter for the current nesting depth from this array.
The actual for loop: We take the for loop into parts:
We initialize the counter for the current nesting depth to 0 (done by counter[depth] = 0).
The iteration step is the most complicated part: We have to check if the loop at the current nesting depth has reached its end. If so, we have do update the nesting depth accordingly. If not, we have to increment the current nesting depth's counter by 1. The variable lastloop is 1 if this is the last iteration, otherwise 0, and we adjust the current nesting depth accordingly. The main problem here is that we have to write this as a sequence of expressions, all separated by commata, which requires us to write all these conditions in a very non-straight-forward way.
The "increment step" of the for loop consists of only one assignment, that increments the appropriate counter (i.e. the element of counter of the proper nesting depth) and assigns this value to our "counter variable" _.
What about this??
void DostuffFunction(){}
for (unsigned i = 0; i < 2; ++i, DostuffFunction());
Regards,
Pablo.
What abelenky said.
And if your { // Do stuff } is multi-line, make it a function, and call that function -- twice.
Many people suggest writing out the code twice, which is fine if the code is short. There is, however, a size of code block which would be awkward to copy but is not large enough to merit its own function (especially if that function would need an excessive number of parameters). My own normal idiom to run a loop 'n' times is
i = number_of_reps;
do
{
... whatever
} while(--i);
In some measure because I'm frequently coding for an embedded system where the up-counting loop is often inefficient enough to matter, and in some measure because it's easy to see the number of repetitions. Running things twice is a bit awkward because the most efficient coding on my target system
bit rep_flag;
rep_flag = 0;
do
{
...
} while(rep_flag ^= 1); /* Note: if loop runs to completion, leaves rep_flag clear */
doesn't read terribly well. Using a numeric counter suggests the number of reps can be varied arbitrarily, which in many instances won't be the case. Still, a numeric counter is probably the best bet.
As Edsger W. Dijkstra himself put it : "two or more, use a for". No need to be any simpler.
Another attempt:
for(i=2;i--;) /* Do stuff */
This solution has many benefits:
Shortest form possible, I claim (13 chars)
Still, readable
Includes initialization
The amount of repeats ("2") is visible in the code
Can be used as a toggle (1 or 0) inside the body e.g. for alternation
Works with single instruction, instruction body or function call
Flexible (doesn't have to be used only for "doing twice")
Dijkstra compliant ;-)
From comment:
for (i=2; i--; "Do stuff");
Use function:
func();
func();
Or use macro (not recommended):
#define DO_IT_TWICE(A) A; A
DO_IT_TWICE({ x+=cos(123); func(x); })
If your compiler supports this just put the declaration inside the for statement:
for (unsigned i = 0; i < 2; ++i)
{
// Do stuff
}
This is as elegant and efficient as it can be. Modern compilers can do loop unrolling and all that stuff, trust them. If you don't trust them, check the assembler.
And it has one little advantage to all other solutions, for everybody it just reads, "do it twice".
Assuming C++0x lambda support:
template <typename T> void twice(T t)
{
t();
t();
}
twice([](){ /*insert code here*/ });
Or:
twice([]()
{
/*insert code here*/
});
Which doesn't help you since you wanted it for C.
Good rule: three or more, do a for.
I think I read that in Code Complete, but I could be wrong. So in your case you don't need a for loop.
This is the shortest possible without preprocessor/template/duplication tricks:
for(int i=2; i--; ) /*do stuff*/;
Note that the decrement happens once right at the beginning, which is why this will loop precisely twice with the indices 1 and 0 as requested.
Alternatively you can write
for(int i=2; i--; /*do stuff*/) ;
But that's purely a difference of taste.
If what you are doing is somewhat complicated wrap it in a function and call that function twice? (This depends on how many local variables your do stuff code relies on).
You could do something like
void do_stuff(int i){
// do stuff
}
do_stuff(0);
do_stuff(1);
But this may get extremely ugly if you are working on a whole bunch of local variables.
//dostuff
stuff;
//dostuff (Attention I am doing the same stuff for the :**2nd** time)
stuff;
First, use a comment
/* Do the following stuff twice */
then,
1) use the for loop
2) write the statement twice, or
3) write a function and call the function twice
do not use macros, as earlier stated, macros are evil.
(My answer's almost a triangle)
What is elegance? How do you measure it? Is someone paying you to be elegant? If so how do they determine the dollar-to-elegance conversion?
When I ask myself, "how should this be written," I consider the priorities of the person paying me. If I'm being paid to write fast code, control-c, control-v, done. If I'm being paid to write code fast, well.. same thing. If I'm being paid to write code that occupies the smallest amount of space on the screen, I short the stock of my employer.
jump instruction is pretty slow,so if you write the lines one after the other,it would work faster,than writing a loop. but modern compilers are very,very smart and the optimizations are great (if they are allowed,of course). if you have turned on your compiler's optimizations,you don't care the way,you write it - with loop or not (:
EDIT : http://en.wikipedia.org/wiki/compiler_optimizations just take a look (:
Close to your example, elegant and efficient:
for (i = 2; i; --i)
{
/* Do stuff */
}
Here's why I'd recommend that approach:
It initializes the iterator to the number of iterations, which makes intuitive sense.
It uses decrement over increment so that the loop test expression is a comparison to zero (the "i;" can be interpreted as "is i true?" which in C means "is i non-zero"), which may optimize better on certain architectures.
It uses pre-decrement as opposed to post-decrement in the counting expression for the same reason (may optimize better).
It uses a for loop instead of do/while or goto or XOR or switch or macro or any other trick approach because readability and maintainability are more elegant and important than clever hacks.
It doesn't require you to duplicate the code for "Do stuff" so that you can avoid a loop. Duplicated code is an abomination and a maintenance nightmare.
If "Do stuff" is lengthy, move it into a function and give the compiler permission to inline it if beneficial. Then call the function from within the for loop.
I like Chris Case's solution (up here), but C language doesn't have default parameters.
My solution:
bool cy = false;
do {
// Do stuff twice
} while (cy = !cy);
If you want, you could do different things in the two cycle by checking the boolean variable (maybe by ternary operator).
void loopTwice (bool first = true)
{
// Recursion is your friend
if (first) {loopTwice(false);}
// Do Stuff
...
}
I'm sure there's a more elegant way, but this is simple to read, and pretty simply to write. There might even be a way to eliminate the bool parameter, but this is what I came up with in 20 seconds.
I'm writing a loop in C, and I am just wondering on how to optimize it a bit. It's not crucial here as I'm just practicing, but for further knowledge, I'd like to know:
In a loop, for example the following snippet:
int i = 0;
while (i < 10) {
printf("%d\n", i);
i++;
}
Does the processor check both (i < 10) and (i == 10) for every iteration? Or does it just check (i < 10) and, if it's true, continue?
If it checks both, wouldn't:
int i = 0;
while (i != 10) {
printf("%d\n", i);
i++;
}
be more efficient?
Thanks!
Both will be translated in a single assembly instruction. Most CPUs have comparison instructions for LESS THAN, for LESS THAN OR EQUAL, for EQUAL and for NOT EQUAL.
One of the interesting things about these optimization questions is that they often show why you should code for clarity/correctness before worrying about the performance impact of these operations (which oh-so often don't have any difference).
Your 2 example loops do not have the same behavior:
int i = 0;
/* this will print 11 lines (0..10) */
while (i <= 10) {
printf("%d\n", i);
i++;
}
And,
int i = 0;
/* This will print 10 lines (0..9) */
while (i != 10) {
printf("%d\n", i);
i++;
}
To answer your question though, it's nearly certain that the performance of the two constructs would be identical (assuming that you fixed the problem so the loop counts were the same). For example, if your processor could only check for equality and whether one value were less than another in two separate steps (which would be a very unusual processor), then the compiler would likely transform the (i <= 10) to an (i < 11) test - or maybe an (i != 11) test.
This a clear example of early optimization.... IMHO, that is something that programmers new to their craft are way to prone to worry about. If you must worry about it, learn to benchmark and profile your code so that your worries are based on evidence rather than supposition.
Speaking to your specific questions. First, a <= is not implemented as two operations testing for < and == separately in any C compiler I've met in my career. And that includes some monumentally stupid compilers. Notice that for integers, a <= 5 is the same condition as a < 6 and if the target architecture required that only < be used, that is what the code generator would do.
Your second concern, that while (i != 10) might be more efficient raises an interesting issue of defensive programming. First, no it isn't any more efficient in any reasonable target architecture. However, it raises a potential for a small bug to cause a larger failure. Consider this: if some line of code within the body of the loop modified i, say by making it greater than 10, what might happen? How long would it take for the loop to end, and would there be any other consequences of the error?
Finally, when wondering about this kind of thing, it often is worthwhile to find out what code the compiler you are using actually generates. Most compilers provide a mechanism to do this. For GCC, learn about the -S option which will cause it to produce the assembly code directly instead of producing an object file.
The operators <= and < are a single instruction in assembly, there should be no performance difference.
Note that tests for 0 can be a bit faster on some processors than to test for any other constant, therefore it can be reasonable to make a loop run backward:
int i = 10;
while (i != 0)
{
printf("%d\n", i);
i--;
}
Note that micro optimizations like these usually can gain you only very little more performance, better use your time to use efficient algorithms.
Does the processor check both (i < 10) and (i == 10) for every iteration? Or does it just check (i < 10) and, if it's true, continue?
Neither, it will most likely check (i < 11). The <= 10 is just there for you to give better meaning to your code since 11 is a magic number which actually means (10+1).
Depends on the architecture and compiler. On most architectures, there is a single instruction for <= or the opposite, which can be negated, so if it is translated into a loop, the comparison will most likely be only one instruction. (On x86 or x86_64 it is one instruction)
The compiler might unroll the loop into a sequence of ten times i++, when only constant expressions are involved it will even optimize the ++ away and leave only constants.
And Ira is right, the comparison does vanish if there is a printf involved, which execution time might be millions of clock cycles.
I'm writing a loop in C, and I am just wondering on how to optimize it a bit.
If you compile with optimizations turned on, the biggest optimization will be from unrolling that loop.
It's going to be hard to profile that code with -O2, because for trivial functions the compiler will unroll the loop and you won't be able to benchmark actual differences in compares. You should be careful when profiling test cases that use constants that might make the code trivial when optimized by the compiler.
disassemble. Depending on the processor, and optimization and a number of things this simple example code actually unrolls or does things that do not reflect your real question. Compiling with gcc -O1 though both example loops you provided resulted in the same assembler (for arm).
Less than in your C code often turns into a branch if greater than or equal to the far side of the loop. If your processor doesnt have a greater than or equal it may have a branch if greater than and a branch if equal, two instructions.
typically though there will be a register holding i. there will be an instruction to increment i. Then an instruction to compare i with 10, then equal to, greater than or equal, and less than are generally done in a single instruction so you should not normally see a difference.
// Case I
int i = 0;
while (i < 10) {
printf("%d\n", i);
i++;
printf("%d\n", i);
i++;
}
// Case II
int i = 0;
while (i < 10) {
printf("%d\n", i);
i++;
}
Case I code take more space but fast and Case II code is take less space but slow compare to Case I code.
Because in programming space complexity and time complexity always proportional to each other. It means you must compromise either space or time.
So in that way you can optimize your time complexity or space complexity but not both.
And your both code are same.
Our computer science teacher once said that for some reason it is faster to count down than to count up.
For example if you need to use a FOR loop and the loop index is not used somewhere (like printing a line of N * to the screen)
I mean that code like this:
for (i = N; i >= 0; i--)
putchar('*');
is faster than:
for (i = 0; i < N; i++)
putchar('*');
Is it really true? And if so, does anyone know why?
Is it really true? and if so does anyone know why?
In ancient days, when computers were still chipped out of fused silica by hand, when 8-bit microcontrollers roamed the Earth, and when your teacher was young (or your teacher's teacher was young), there was a common machine instruction called decrement and skip if zero (DSZ). Hotshot assembly programmers used this instruction to implement loops. Later machines got fancier instructions, but there were still quite a few processors on which it was cheaper to compare something with zero than to compare with anything else. (It's true even on some modern RISC machines, like PPC or SPARC, which reserve a whole register to be always zero.)
So, if you rig your loops to compare with zero instead of N, what might happen?
You might save a register
You might get a compare instruction with a smaller binary encoding
If a previous instruction happens to set a flag (likely only on x86 family machines), you might not even need an explicit compare instruction
Are these differences likely to result in any measurable improvement on real programs on a modern out-of-order processor? Highly unlikely. In fact, I'd be impressed if you could show a measurable improvement even on a microbenchmark.
Summary: I smack your teacher upside the head! You shouldn't be learning obsolete pseudo-facts about how to organize loops. You should be learning that the most important thing about loops is to be sure that they terminate, produce correct answers, and are easy to read. I wish your teacher would focus on the important stuff and not mythology.
Here's what might happen on some hardware depending on what the compiler can deduce about the range of the numbers you're using: with the incrementing loop you have to test i<N each time round the loop. For the decrementing version, the carry flag (set as a side effect of the subtraction) may automatically tell you if i>=0. That saves a test per time round the loop.
In reality, on modern pipelined processor hardware, this stuff is almost certainly irrelevant as there isn't a simple 1-1 mapping from instructions to clock cycles. (Though I could imagine it coming up if you were doing things like generating precisely timed video signals from a microcontroller. But then you'd write in assembly language anyway.)
In the Intel x86 instruction set, building a loop to count down to zero can usually be done with fewer instructions than a loop that counts up to a non-zero exit condition. Specifically, the ECX register is traditionally used as a loop counter in x86 asm, and the Intel instruction set has a special jcxz jump instruction that tests the ECX register for zero and jumps based on the result of the test.
However, the performance difference will be negligible unless your loop is already very sensitive to clock cycle counts. Counting down to zero might shave 4 or 5 clock cycles off each iteration of the loop compared to counting up, so it's really more of a novelty than a useful technique.
Also, a good optimizing compiler these days should be able to convert your count up loop source code into count down to zero machine code (depending on how you use the loop index variable) so there really isn't any reason to write your loops in strange ways just to squeeze a cycle or two here and there.
Yes..!!
Counting from N down to 0 is slightly faster that Counting from 0 to N in the sense of how hardware will handle comparison..
Note the comparison in each loop
i>=0
i<N
Most processors have comparison with zero instruction..so the first one will be translated to machine code as:
Load i
Compare and jump if Less than or Equal zero
But the second one needs to load N form Memory each time
load i
load N
Sub i and N
Compare and jump if Less than or Equal zero
So it is not because of counting down or up.. But because of how your code will be translated into machine code..
So counting from 10 to 100 is the same as counting form 100 to 10
But counting from i=100 to 0 is faster than from i=0 to 100 - in most cases
And counting from i=N to 0 is faster than from i=0 to N
Note that nowadays compilers may do this optimization for you (if it is smart enough)
Note also that pipeline can cause Belady's anomaly-like effect (can not be sure what will be better)
At last: please note that the 2 for loops you have presented are not equivalent.. the first prints one more * ....
Related:
Why does n++ execute faster than n=n+1?
In C to psudo-assembly:
for (i = 0; i < 10; i++) {
foo(i);
}
turns into
clear i
top_of_loop:
call foo
increment i
compare 10, i
jump_less top_of_loop
while:
for (i = 10; i >= 0; i--) {
foo(i);
}
turns into
load i, 10
top_of_loop:
call foo
decrement i
jump_not_neg top_of_loop
Note the lack of the compare in the second psudo-assembly. On many architectures there are flags that are set by arithmatic operations (add, subtract, multiply, divide, increment, decrement) which you can use for jumps. These often give you what is essentially a comparison of the result of the operation with 0 for free. In fact on many architectures
x = x - 0
is semantically the same as
compare x, 0
Also, the compare against a 10 in my example could result in worse code. 10 may have to live in a register, so if they are in short supply that costs and may result in extra code to move things around or reload the 10 every time through the loop.
Compilers can sometimes rearrange the code to take advantage of this, but it is often difficult because they are often unable to be sure that reversing the direction through the loop is semantically equivalent.
Count down faster in case like this:
for (i = someObject.getAllObjects.size(); i >= 0; i--) {…}
because someObject.getAllObjects.size() executes once at the beginning.
Sure, similar behaviour can be achieved by calling size() out of the loop, as Peter mentioned:
size = someObject.getAllObjects.size();
for (i = 0; i < size; i++) {…}
What matters much more than whether you're increasing or decreasing your counter is whether you're going up memory or down memory. Most caches are optimized for going up memory, not down memory. Since memory access time is the bottleneck that most programs today face, this means that changing your program so that you go up memory might result in a performance boost even if this requires comparing your counter to a non-zero value. In some of my programs, I saw a significant improvement in performance by changing my code to go up memory instead of down it.
Skeptical? Just write a program to time loops going up/down memory. Here's the output that I got:
Average Up Memory = 4839 mus
Average Down Memory = 5552 mus
Average Up Memory = 18638 mus
Average Down Memory = 19053 mus
(where "mus" stands for microseconds) from running this program:
#include <chrono>
#include <iostream>
#include <random>
#include <vector>
using namespace std;
//Sum all numbers going up memory.
template<class Iterator, class T>
inline void sum_abs_up(Iterator first, Iterator one_past_last, T &total) {
T sum = 0;
auto it = first;
do {
sum += *it;
it++;
} while (it != one_past_last);
total += sum;
}
//Sum all numbers going down memory.
template<class Iterator, class T>
inline void sum_abs_down(Iterator first, Iterator one_past_last, T &total) {
T sum = 0;
auto it = one_past_last;
do {
it--;
sum += *it;
} while (it != first);
total += sum;
}
//Time how long it takes to make num_repititions identical calls to sum_abs_down().
//We will divide this time by num_repitions to get the average time.
template<class T>
chrono::nanoseconds TimeDown(vector<T> &vec, const vector<T> &vec_original,
size_t num_repititions, T &running_sum) {
chrono::nanoseconds total{0};
for (size_t i = 0; i < num_repititions; i++) {
auto start_time = chrono::high_resolution_clock::now();
sum_abs_down(vec.begin(), vec.end(), running_sum);
total += chrono::high_resolution_clock::now() - start_time;
vec = vec_original;
}
return total;
}
template<class T>
chrono::nanoseconds TimeUp(vector<T> &vec, const vector<T> &vec_original,
size_t num_repititions, T &running_sum) {
chrono::nanoseconds total{0};
for (size_t i = 0; i < num_repititions; i++) {
auto start_time = chrono::high_resolution_clock::now();
sum_abs_up(vec.begin(), vec.end(), running_sum);
total += chrono::high_resolution_clock::now() - start_time;
vec = vec_original;
}
return total;
}
template<class Iterator, typename T>
void FillWithRandomNumbers(Iterator start, Iterator one_past_end, T a, T b) {
random_device rnd_device;
mt19937 generator(rnd_device());
uniform_int_distribution<T> dist(a, b);
for (auto it = start; it != one_past_end; it++)
*it = dist(generator);
return ;
}
template<class Iterator>
void FillWithRandomNumbers(Iterator start, Iterator one_past_end, double a, double b) {
random_device rnd_device;
mt19937_64 generator(rnd_device());
uniform_real_distribution<double> dist(a, b);
for (auto it = start; it != one_past_end; it++)
*it = dist(generator);
return ;
}
template<class ValueType>
void TimeFunctions(size_t num_repititions, size_t vec_size = (1u << 24)) {
auto lower = numeric_limits<ValueType>::min();
auto upper = numeric_limits<ValueType>::max();
vector<ValueType> vec(vec_size);
FillWithRandomNumbers(vec.begin(), vec.end(), lower, upper);
const auto vec_original = vec;
ValueType sum_up = 0, sum_down = 0;
auto time_up = TimeUp(vec, vec_original, num_repititions, sum_up).count();
auto time_down = TimeDown(vec, vec_original, num_repititions, sum_down).count();
cout << "Average Up Memory = " << time_up/(num_repititions * 1000) << " mus\n";
cout << "Average Down Memory = " << time_down/(num_repititions * 1000) << " mus"
<< endl;
return ;
}
int main() {
size_t num_repititions = 1 << 10;
TimeFunctions<int>(num_repititions);
cout << '\n';
TimeFunctions<double>(num_repititions);
return 0;
}
Both sum_abs_up and sum_abs_down do the same thing (sum the vector of numbers) and are timed the same way with the only difference being that sum_abs_up goes up memory while sum_abs_down goes down memory. I even pass vec by reference so that both functions access the same memory locations. Nevertheless, sum_abs_up is consistently faster than sum_abs_down. Give it a run yourself (I compiled it with g++ -O3).
It's important to note how tight the loop that I'm timing is. If a loop's body is large (has a lot of code) then it likely won't matter whether its iterator goes up or down memory since the time it takes to execute the loop's body will likely completely dominate. Also, it's important to mention that with some rare loops, going down memory is sometimes faster than going up it. But even with such loops it was never the case that going up memory was always slower than going down (unlike small-bodied loops that go up memory, for which the opposite is frequently true; in fact, for a small handful of loops I've timed, the increase in performance by going up memory was 40+%).
The point is, as a rule of thumb, if you have the option, if the loop's body is small, and if there's little difference between having your loop go up memory instead of down it, then you should go up memory.
FYI vec_original is there for experimentation, to make it easy to change sum_abs_up and sum_abs_down in a way that makes them alter vec while not allowing these changes to affect future timings. I highly recommend playing around with sum_abs_up and sum_abs_down and timing the results.
On some older CPUs there are/were instructions like DJNZ == "decrement and jump if not zero". This allowed for efficient loops where you loaded an initial count value into a register and then you could effectively manage a decrementing loop with one instruction. We're talking 1980s ISAs here though - your teacher is seriously out of touch if he thinks this "rule of thumb" still applies with modern CPUs.
Is it faster to count down than up?
Maybe. But far more than 99% of the time it won't matter, so you should use the most 'sensible' test for terminating the loop, and by sensible, I mean that it takes the least amount of thought by a reader to figure out what the loop is doing (including what makes it stop). Make your code match the mental (or documented) model of what the code is doing.
If the loop is working it's way up through an array (or list, or whatever), an incrementing counter will often match up better with how the reader might be thinking of what the loop is doing - code your loop this way.
But if you're working through a container that has N items, and are removing the items as you go, it might make more cognitive sense to work the counter down.
A bit more detail on the 'maybe' in the answer:
It's true that on most architectures, testing for a calculation resulting in zero (or going from zero to negative) requires no explicit test instruction - the result can be checked directly. If you want to test whether a calculation results in some other number, the instruction stream will generally have to have an explicit instruction to test for that value. However, especially with modern CPUs, this test will usually add less than noise-level additional time to a looping construct. Particularly if that loop is performing I/O.
On the other hand, if you count down from zero, and use the counter as an array index, for example, you might find the code working against the memory architecture of the system - memory reads will often cause a cache to 'look ahead' several memory locations past the current one in anticipation of a sequential read. If you're working backwards through memory, the caching system might not anticipate reads of a memory location at a lower memory address. In this case, it's possible that looping 'backwards' might hurt performance. However, I'd still probably code the loop this way (as long as performance didn't become an issue) because correctness is paramount, and making the code match a model is a great way to help ensure correctness. Incorrect code is as unoptimized as you can get.
So I would tend to forget the professor's advice (of course, not on his test though - you should still be pragmatic as far as the classroom goes), unless and until the performance of the code really mattered.
Bob,
Not until you are doing microoptimizations, at which point you will have the manual for your CPU to hand. Further, if you were doing that sort of thing, you probably wouldn't be needing to ask this question anyway. :-) But, your teacher evidently doesn't subscribe to that idea....
There are 4 things to consider in your loop example:
for (i=N;
i>=0; //thing 1
i--) //thing 2
{
putchar('*'); //thing 3
}
Comparison
Comparison is (as others have indicated) relevant to particular processor architectures. There are more types of processors than those that run Windows. In particular, there might be an instruction that simplifies and speeds up comparisons with 0.
Adjustment
In some cases, it is faster to adjust up or down. Typically a good compiler will figure it out and redo the loop if it can. Not all compilers are good though.
Loop Body
You are accessing a syscall with putchar. That is massively slow. Plus, you are rendering onto the screen (indirectly). That is even slower. Think 1000:1 ratio or more. In this situation, the loop body totally and utterly outweighs the cost of the loop adjustment/comparison.
Caches
A cache and memory layout can have a large effect on performance. In this situation, it doesn't matter. However, if you were accessing an array and needed optimal performance, it would behoove you to investigate how your compiler and your processor laid out memory accessses and to tune your software to make the most of that. The stock example is the one given in relation to matrix multiplication.
It can be faster.
On the NIOS II processor I'm currently working with, the traditional for loop
for(i=0;i<100;i++)
produces the assembly:
ldw r2,-3340(fp) %load i to r2
addi r2,r2,1 %increase i by 1
stw r2,-3340(fp) %save value of i
ldw r2,-3340(fp) %load value again (???)
cmplti r2,r2,100 %compare if less than equal 100
bne r2,zero,0xa018 %jump
If we count down
for(i=100;i--;)
we get an assembly that needs 2 instructions less.
ldw r2,-3340(fp)
addi r3,r2,-1
stw r3,-3340(fp)
bne r2,zero,0xa01c
If we have nested loops, where the inner loop is executed a lot, we can have a measurable difference:
int i,j,a=0;
for(i=100;i--;){
for(j=10000;j--;){
a = j+1;
}
}
If the inner loop is written like above, the execution time is: 0.12199999999999999734 seconds.
If the inner loop is written the traditional way, the execution time is: 0.17199999999999998623 seconds. So the loop counting down is about 30% faster.
But: this test was made with all GCC optimizations turned off. If we turn them on, the compiler is actually smarter than this handish optimization and even keeps the value in a register during the whole loop and we would get an assembly like
addi r2,r2,-1
bne r2,zero,0xa01c
In this particular example the compiler even notices, that variable a will allways be 1 after the loop execution and skips the loops alltogether.
However I experienced that sometimes if the loop body is complex enough, the compiler is not able to do this optimization, so the safest way to always get a fast loop execution is to write:
register int i;
for(i=10000;i--;)
{ ... }
Of course this only works, if it does not matter that the loop is executed in reverse and like Betamoo said, only if you are counting down to zero.
regardless of the direction always use the prefix form (++i instead of i++)!
for (i=N; i>=0; --i)
or
for (i=0; i<N; ++i)
Explanation: http://www.eskimo.com/~scs/cclass/notes/sx7b.html
Furthermore you can write
for (i=N; i; --i)
But i would expect modern compilers to be able to do exactly these optimizations.
It is an interesting question, but as a practical matter I don't think it's important and does not make one loop any better than the other.
According to this wikipedia page: Leap second, "...the solar day becomes 1.7 ms longer every century due mainly to tidal friction." But if you are counting days until your birthday, do you really care about this tiny difference in time?
It's more important that the source code is easy to read and understand. Those two loops are a good example of why readability is important -- they don't loop the same number of times.
I would bet that most programmers read (i = 0; i < N; i++) and understand immediately that this loops N times. A loop of (i = 1; i <= N; i++), for me anyway, is a little less clear, and with (i = N; i > 0; i--) I have to think about it for a moment. It's best if the intent of the code goes directly into the brain without any thinking required.
Strangely, it appears that there IS a difference. At least, in PHP. Consider following benchmark:
<?php
print "<br>".PHP_VERSION;
$iter = 100000000;
$i=$t1=$t2=0;
$t1 = microtime(true);
for($i=0;$i<$iter;$i++){}
$t2 = microtime(true);
print '<br>$i++ : '.($t2-$t1);
$t1 = microtime(true);
for($i=$iter;$i>0;$i--){}
$t2 = microtime(true);
print '<br>$i-- : '.($t2-$t1);
$t1 = microtime(true);
for($i=0;$i<$iter;++$i){}
$t2 = microtime(true);
print '<br>++$i : '.($t2-$t1);
$t1 = microtime(true);
for($i=$iter;$i>0;--$i){}
$t2 = microtime(true);
print '<br>--$i : '.($t2-$t1);
Results are interesting:
PHP 5.2.13
$i++ : 8.8842368125916
$i-- : 8.1797409057617
++$i : 8.0271911621094
--$i : 7.1027431488037
PHP 5.3.1
$i++ : 8.9625310897827
$i-- : 8.5790238380432
++$i : 5.9647901058197
--$i : 5.4021768569946
If someone knows why, it would be nice to know :)
EDIT: Results are the same even if you start counting not from 0, but other arbitrary value. So there is probably not only comparison to zero which makes a difference?
What your teacher have said was some oblique statement without much clarification.
It is NOT that decrementing is faster than incrementing but you can create much much faster loop with decrement than with increment.
Without going on at length about it, without need of using loop counter etc - what matters below is just speed and loop count (non zero).
Here is how most people implement loop with 10 iterations:
int i;
for (i = 0; i < 10; i++)
{
//something here
}
For 99% of cases it is all one may need but along with PHP, PYTHON, JavaScript there is the whole world of time critical software (usually embedded, OS, games etc) where CPU ticks really matter so look briefly at assembly code of:
int i;
for (i = 0; i < 10; i++)
{
//something here
}
after compilation (without optimisation) compiled version may look like this (VS2015):
-------- C7 45 B0 00 00 00 00 mov dword ptr [i],0
-------- EB 09 jmp labelB
labelA 8B 45 B0 mov eax,dword ptr [i]
-------- 83 C0 01 add eax,1
-------- 89 45 B0 mov dword ptr [i],eax
labelB 83 7D B0 0A cmp dword ptr [i],0Ah
-------- 7D 02 jge out1
-------- EB EF jmp labelA
out1:
The whole loop is 8 instructions (26 bytes). In it - there are actually 6 instructions (17 bytes) with 2 branches. Yes yes I know it can be done better (its just an example).
Now consider this frequent construct which you will often find written by embedded developer:
i = 10;
do
{
//something here
} while (--i);
It also iterates 10 times (yes I know i value is different compared with shown for loop but we care about iteration count here).
This may be compiled into this:
00074EBC C7 45 B0 01 00 00 00 mov dword ptr [i],1
00074EC3 8B 45 B0 mov eax,dword ptr [i]
00074EC6 83 E8 01 sub eax,1
00074EC9 89 45 B0 mov dword ptr [i],eax
00074ECC 75 F5 jne main+0C3h (074EC3h)
5 instructions (18 bytes) and just one branch. Actually there are 4 instruction in the loop (11 bytes).
The best thing is that some CPUs (x86/x64 compatible included) have instruction that may decrement a register, later compare result with zero and perform branch if result is different than zero. Virtually ALL PC cpus implement this instruction. Using it the loop is actually just one (yes one) 2 byte instruction:
00144ECE B9 0A 00 00 00 mov ecx,0Ah
label:
// something here
00144ED3 E2 FE loop label (0144ED3h) // decrement ecx and jump to label if not zero
Do I have to explain which is faster?
Now even if particular CPU does not implement above instruction all it requires to emulate it is a decrement followed by conditional jump if result of previous instruction happens to be zero.
So regardless of some cases that you may point out as an comment why I am wrong etc etc I EMPHASISE - YES IT IS BENEFICIAL TO LOOP DOWNWARDS if you know how, why and when.
PS. Yes I know that wise compiler (with appropriate optimisation level) will rewrite for loop (with ascending loop counter) into do..while equivalent for constant loop iterations ... (or unroll it) ...
No, that's not really true. One situation where it could be faster is when you would otherwise be calling a function to check the bounds during every iteration of a loop.
for(int i=myCollection.size(); i >= 0; i--)
{
...
}
But if it's less clear to do it that way, it's not worthwhile. In modern languages, you should use a foreach loop when possible, anyway. You specifically mention the case where you should use a foreach loop -- when you don't need the index.
The point is that when counting down you don't need to check i >= 0 separately to decrementing i. Observe:
for (i = 5; i--;) {
alert(i); // alert boxes showing 4, 3, 2, 1, 0
}
Both the comparison and decrementing i can be done in the one expression.
See other answers for why this boils down to fewer x86 instructions.
As to whether it makes a meaningful difference in your application, well I guess that depends on how many loops you have and how deeply nested they are. But to me, it's just as readable to do it this way, so I do it anyway.
Now, I think you had enough assembly lectures:) I would like to present you another reason for top->down approach.
The reason to go from the top is very simple. In the body of the loop, you might accidentally change the boundary, which might end in incorrect behaviour or even non-terminating loop.
Look at this small portion of Java code (the language does not matter I guess for this reason):
System.out.println("top->down");
int n = 999;
for (int i = n; i >= 0; i--) {
n++;
System.out.println("i = " + i + "\t n = " + n);
}
System.out.println("bottom->up");
n = 1;
for (int i = 0; i < n; i++) {
n++;
System.out.println("i = " + i + "\t n = " + n);
}
So my point is you should consider prefering going from the top down or having a constant as a boundary.
At an assembler level a loop that counts down to zero is generally slightly faster than one that counts up to a given value. If the result of a calculation is equal to zero most processors will set a zero flag. If subtracting one makes a calculation wrap around past zero this will normally change the carry flag (on some processors it will set it on others it will clear it), so the comparison with zero comes essentially for free.
This is even more true when the number of iterations is not a constant but a variable.
In trivial cases the compiler may be able to optimise the count direction of a loop automatically but in more complex cases it may be that the programmer knows that the direction of the loop is irrelevant to the overall behaviour but the compiler cannot prove that.