Can I use Duff's Device on an array in C? - c

I have a loop here and I want to make it run faster. I am passing in a large array. I recently heard of Duff's Device can it be applied to this for loop? any ideas?
for (i = 0; i < dim; i++) {
for (j = 0; j < dim; j++) {
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}
}

Please, please don't use Duff's device. A thousand maintenance programmers will thank you. I used to work for a training company where someone thought it funny to introduce the device in the first ten pages of their C programming course. As an instructor it was impossible to deal with, unless (as the guy that that wrote that bit of the course apparently did) you believe in "kewl" coding.
Needless to say, I had the thing expunged from the course, ASAP.

Why do you want to make it run faster?
Is there an actual performance problem?
If so, have you profiled and found that this is executing often enough, and hence worth optimizing?
If so, you may want to write it in two ways, the straightforward way you have now and with Duff's Device, or any other method you like.
At that point, you test the performance. You may be surprised. Modern optimizers are quite good, and modern CPUs are really complicated, so source-level optimization is often counterproductive. (I once did this in a loop that was taking a whole lot of time, and found that tightening up the loop, even while introducing some indirection, improved performance. Your mileage is almost certainly going to vary.)
Finally, if Duff's Device is indeed faster, you have to decide whether the performance improvement is worth taking this straightforward and optimizable code and substituting a maintenance problem that may not improve performance at all in the next compiler version.

You should never unroll loops by hand. It would only give you a very platform-specific advantage, if any. All good compilers can unroll loops, but it's not even guaranteed to make the code faster, because it takes up more memory bandwidth to read a longer program from main memory.
If you want the loop to run fast, you should make sure that whatever RIDX computes, dst is accessed sequentially, so you minimize the number of cache misses. Other than that I can't see how you could make the loop faster.

Duff's Device is simply a technique for loop unrolling. And since any loop can be unrolled, you can use Duff's Device.

Were you able to figure this out and get a gain it would be a pittance and would in no way justify the complexity.
You would be better served spending your energies a level up--reconsidering your entire solution. Perhaps rather than copying values you could create a translation array and spend a little more time looking up answers indirectly when you need them (not really a good idea for building images--just trying to give you a different way to look at it).
Or maybe there is some completely different approach--look at your entire problem and try completely throwing away your current approaches and concepts and just see if there is something you haven't considered because you are too tied to this implementation.
Could your graphics card do some of this work?
Rethinking the problem at a high level works a lot more often than you might think.
Edit:
Looking at your sample more, it looks like you are taking a block of your image and copying it, pixel for pixel, to another image. If so, there are almost certainly ways to do it getting rid of the macro and copying byte for byte instead, or even using a block move assembly function then tweaking the edges of the result to match.
Or I may have guessed wrong, but chances are that looking at it on a larger scale than pixel for pixel might help you a lot more than unrolling loops.

The number of instruction cycles to implement the statement
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
will far outweigh the loop overhead, so unrolling the loop will be very little help on a percentage basis.

I believe this is a candidate for Duff's Device, depending on what the RIDX() function does. But I hope you don't expect someone to write the code for you... Also, you might want to format your code properly so it's actually readable.

Probably, as long as dim is a power of 2 or you have fast modulus on your target system. Learned something new today. I independently discovered that construct 5 years back and dropped it into our memCopy() routine. Who knew :)

Pedantically, no. Duff's Device was for writing to a hardware register (thus the target of the copy was always the same address).
You can implement something very much like Duff's Device for a copy like this, but there will be a clarity and maintenance cost. I'd first profile to make sure it's a problem. I'd also look into whether you can simplify the indexing, as that may enable the compiler to do the dirty work of unrolling the loop.

If you use it, make sure you measure it to determine that the improvement is both real , significant, and necessary in terms of your performance requirements. I doubt it will be.
For large loops, the remainder dealt with by Duff's device will be an insignificant proportion of the operation, and for small loops where the remainder is significant you will only see a benefit if you have many such loops (themselves in a loop), because small loops by definition don't take that long! Even then the compiler's optimiser is likely to do as well or better without rendering your code unreadable. It is also possible that the application of Duff's device will prevent the optimiser from applying more perhaps effective optimisations, which is why if you use it you need to measure it.
All the time you are likely to save on this (if any) you have probably wasted several times over reading responses to this question.

Duff's device may not be the optimized solution in an unrolled loop.
I had a function that sent a bit to a port, followed by a clock pulse to another port. For each bit, the functions were:
if (bit == 1)
{
write to the set port.
}
else
{
write to the clear port.
}
write high clock bit.
write low clock bit.
This was put into a Duff's device loop, along with bit shifting and bit count incrementing.
I improved the efficiency of the loop by using nibble values instead of bits (a nibble being 4 bits). The switch statement was based on the nibble value. This allowed 4 bits to be processed without any if statements, improving the flow through instruction cache (pipeline).
There are times when Duff's device may not be the optimal solution; but can be the foundation for a more efficient solution.

Modern compilers already do loop unrolling for you when optimizations are turned on, which renders Duff's device obsolete. The compiler knows better than you do the optimal level of unrolling for your compilation target, and you don't have to write any extra code to do it. It was a neat hack at the time, but these days Duff's device is just a historical curiosity, not a good programming practice.

In the end whoever makes the call on optimization everyone involved needs to be sure it is well documented and written in style that is as self documenting as possible using correctly spelled meaningful names for variables, functions etc. So it is obvious if the comments and the code get out of sync.
The need for optimization will never end. I was talking with a grad student that had broken malloc()/free() working on the largest file of genetic data ever attempted in one pass. After while the heap became too fragmented for malloc to to find a block of contiguous RAM to allocate to the calling function. He had to switch to a library that malloc only issued blocks of memory on 32k boundaries. It took 160% more the memory the old library, ran a good slower but it finished the job.
You must be careful using Duff's Device and many other optimizations to be sure the compiler does't optimize your optimization into obscure broken object code. As we enter an environment using automatic parallelizing tools this will become more of a problem.
I expect the lower the level the optimization the more likely future optimizations are to break the code. I can see that my habit of discarding line feeds in code designed to run on multiple platforms and putting the line feed back in in the print and write functions on each platform will run into problems in several of the things discussed in this thread.
-gcouger

Related

C- Why is for loop pointer indexing faster? [duplicate]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Some years ago I was on a panel that was interviewing candidates for a relatively senior embedded C programmer position.
One of the standard questions that I asked was about optimisation techniques. I was quite surprised that some of the candidates didn't have answers.
So, in the interests of putting together a list for posterity - what techniques and constructs do you normally use when optimising C programs?
Answers to optimisation for speed and size both accepted.
First things first - don't optimise too early. It's not uncommon to spend time carefully optimising a chunk of code only to find that it wasn't the bottleneck that you thought it was going to be. Or, to put it another way "Before you make it fast, make it work"
Investigate whether there's any option for optimising the algorithm before optimising the code. It'll be easier to find an improvement in performance by optimising a poor algorithm than it is to optimise the code, only then to throw it away when you change the algorithm anyway.
And work out why you need to optimise in the first place. What are you trying to achieve? If you're trying, say, to improve the response time to some event work out if there is an opportunity to change the order of execution to minimise the time critical areas. For example when trying to improve the response to some external interrupt can you do any preparation in the dead time between events?
Once you've decided that you need to optimise the code, which bit do you optimise? Use a profiler. Focus your attention (first) on the areas that are used most often.
So what can you do about those areas?
minimise condition checking. Checking conditions (eg. terminating conditions for loops) is time that isn't being spent on actual processing. Condition checking can be minimised with techniques like loop-unrolling.
In some circumstances condition checking can also be eliminated by using function pointers. For example if you are implementing a state machine you may find that implementing the handlers for individual states as small functions (with a uniform prototype) and storing the "next state" by storing the function pointer of the next handler is more efficient than using a large switch statement with the handler code implemented in the individual case statements. YMMV.
minimise function calls. Function calls usually carry a burden of context saving (eg. writing local variables contained in registers to the stack, saving the stack pointer), so if you don't have to make a call this is time saved. One option (if you're optimising for speed and not space) is to make use of inline functions.
If function calls are unavoidable minimise the data that is being passed to the functions. For example passing pointers is likely to be more efficient than passing structures.
When optimising for speed choose datatypes that are the native size for your platform. For example on a 32bit processor it is likely to be more efficient to manipulate 32bit values than 8 or 16 bit values. (side note - it is worth checking that the compiler is doing what you think it is. I've had situations where I've discovered that my compiler insisted on doing 16 bit arithmetic on 8 bit values with all of the to and from conversions to go with them)
Find data that can be precalculated, and either calculate during initialisation or (better yet) at compile time. For example when implementing a CRC you can either calculate your CRC values on the fly (using the polynomial directly) which is great for size (but dreadful for performance), or you can generate a table of all of the interim values - which is a much faster implementation, to the detriment of the size.
Localise your data. If you're manipulating a blob of data often your processor may be able to speed things up by storing it all in cache. And your compiler may be able to use shorter instructions that are suited to more localised data (eg. instructions that use 8 bit offsets instead of 32 bit)
In the same vein, localise your functions. For the same reasons.
Work out the assumptions that you can make about the operations that you're performing and find ways of exploiting them. For example, on an 8 bit platform if the only operation that at you're doing on a 32 bit value is an increment you may find that you can do better than the compiler by inlining (or creating a macro) specifically for this purpose, rather than using a normal arithmetic operation.
Avoid expensive instructions - division is a prime example.
The "register" keyword can be your friend (although hopefully your compiler has a pretty good idea about your register usage). If you're going to use "register" it's likely that you'll have to declare the local variables that you want "register"ed first.
Be consistent with your data types. If you are doing arithmetic on a mixture of data types (eg. shorts and ints, doubles and floats) then the compiler is adding implicit type conversions for each mismatch. This is wasted cpu cycles that may not be necessary.
Most of the options listed above can be used as part of normal practice without any ill effects. However if you're really trying to eke out the best performance:
- Investigate where you can (safely) disable error checking. It's not recommended, but it will save you some space and cycles.
- Hand craft portions of your code in assembler. This of course means that your code is no longer portable but where that's not an issue you may find savings here. Be aware though that there is potentially time lost moving data into and out of the registers that you have at your disposal (ie. to satisfy the register usage of your compiler). Also be aware that your compiler should be doing a pretty good job on its own. (of course there are exceptions)
As everybody else has said: profile, profile profile.
As for actual techniques, one that I don't think has been mentioned yet:
Hot & Cold Data Separation: Staying within the CPU's cache is incredibly important. One way of helping to do this is by splitting your data structures into frequently accessed ("hot") and rarely accessed ("cold") sections.
An example: Suppose you have a structure for a customer that looks something like this:
struct Customer
{
int ID;
int AccountNumber;
char Name[128];
char Address[256];
};
Customer customers[1000];
Now, lets assume that you want to access the ID and AccountNumber a lot, but not so much the name and address. What you'd do is to split it into two:
struct CustomerAccount
{
int ID;
int AccountNumber;
CustomerData *pData;
};
struct CustomerData
{
char Name[128];
char Address[256];
};
CustomerAccount customers[1000];
In this way, when you're looping through your "customers" array, each entry is only 12 bytes and so you can fit many more entries in the cache. This can be a huge win if you can apply it to situations like the inner loop of a rendering engine.
My favorite technique is to use a good profiler. Without a good profile telling you where the bottleneck lies, no tricks and techniques are going to help you.
most common techniques I encountered are:
loop unrolling
loop optimization for better cache prefetch
(i.e. do N operations in M cycles instead of NxM singular operations)
data aligning
inline functions
hand-crafted asm snippets
As for general recommendations, most of them are already sounded:
choose better algos
use profiler
don't optimize if it doesn't give 20-30% performance boost
For low-level optimization:
START_TIMER/STOP_TIMER macros from ffmpeg (clock-level accuracy for measurement of any code).
Oprofile, of course, for profiling.
Enormous amounts of hand-coded assembly (just do a wc -l on x264's /common/x86 directory, and then remember most of the code is templated).
Careful coding in general; shorter code is usually better.
Smart low-level algorithms, like the 64-bit bitstream writer I wrote that uses only a single if and no else.
Explicit write-combining.
Taking into account important weird aspects of processors, like Intel's cacheline split issue.
Finding cases where one can losslessly or near-losslessly make an early termination, where the early-termination check costs much less than the speed one gains from it.
Actually inlined assembly for tasks which are far more suited to the x86 SIMD unit, such as median calculations (requires compile-time check for MMX support).
First and foremost, use a better/faster algorithm. There is no point optimizing code that is slow by design.
When optimizing for speed, trade memory for speed: lookup tables of precomputed values, binary trees, write faster custom implementation of system calls...
When trading speed for memory: use in-memory compression
Avoid using the heap. Use obstacks or pool-allocator for identical sized objects. Put small things with short lifetime onto the stack. alloca still exists.
Pre-mature optimization is the root of all evil!
;)
As my applications usually don't need much CPU time by design, I focus on the size my binaries on disk and in memory. What I do mostly is looking out for statically sized arrays and replacing them with dynamically allocated memory where it's worth the additional effort of free'ing the memory later. To cut down the size of the binary, I look for big arrays that are initialized at compile time and put the initializiation to runtime.
char buf[1024] = { 0, };
/* becomes: */
char buf[1024];
memset(buf, 0, sizeof(buf));
This will remove the 1024 zero-bytes from the binaries .DATA section and will instead create the buffer on the stack at runtime and the fill it with zeros.
EDIT: Oh yeah, and I like to cache things. It's not C specific but depending on what you're caching, it can give you a huge boost in performance.
PS: Please let us know when your list is finished, I'm very curious. ;)
If possible, compare with 0, not with arbitrary numbers, especially in loops, because comparison with 0 is often implemented with separate, faster assembler commands.
For example, if possible, write
for (i=n; i!=0; --i) { ... }
instead of
for (i=0; i!=n; ++i) { ... }
Another thing that was not mentioned:
Know your requirements: don't optimize for situations that will unlikely or never happen, concentrate on the most bang for the buck
basics/general:
Do not optimize when you have no problem.
Know your platform/CPU...
...know it thoroughly
know your ABI
Let the compiler do the optimization, just help it with the job.
some things that have actually helped:
Opt for size/memory:
Use bitfields for storing bools
re-use big global arrays by overlaying with a union (be careful)
Opt for speed (be careful):
use precomputed tables where possible
place critical functions/data in fast memory
Use dedicated registers for often used globals
count to-zero, zero flag is free
Difficult to summarize ...
Data structures:
Splitting of a data structure depending on case of usage is extremely important. It is common to see a structure that holds data that is accessed based on a flow control. This situation can lower significantly the cache usage.
To take into account cache line size and prefetch rules.
To reorder the members of the structure to obtain a sequential access to them from your code
Algorithms:
Take time to think about your problem and to find the correct algorithm.
Know the limitations of the algorithm you choose (a radix-sort/quick-sort for 10 elements to be sorted might not be the best choice).
Low level:
As for the latest processors it is not recommended to unroll a loop that has a small body. The processor provides its own detection mechanism for this and will short-circuit whole section of its pipeline.
Trust the HW prefetcher. Of course if your data structures are well designed ;)
Care about your L2 cache line misses.
Try to reduce as much as possible the local working set of your application as the processors are leaning to smaller caches per cores (C2D enjoyed a 3MB per core max where iCore7 will provide a max of 256KB per core + 8MB shared to all cores for a quad core die.).
The most important of all: Measure early, Measure often and never ever makes assumptions, base your thinking and optimizations on data retrieved by a profiler (please use PTU).
Another hint, performance is key to the success of an application and should be considered at design time and you should have clear performance targets.
This is far from being exhaustive but should provide an interesting base.
These days, the most important things in optimzation are:
respecting the cache - try to access memory in simple patterns, and don't unroll loops just for fun. Use arrays instead of data structures with lots of pointer chasing and it'll probably be faster for small amounts of data. And don't make anything too big.
avoiding latency - try to avoid divisions and stuff that's slow if other calculations depend on them immediately. Memory accesses that depend on other memory accesses (ie, a[b[c]]) are bad.
avoiding unpredictabilty - a lot of if/elses with unpredictable conditions, or conditions that introduce more latency, will really mess you up. There's a lot of branchless math tricks that are useful here, but they increase latency and are only useful if you really need them. Otherwise, just write simple code and don't have crazy loop conditions.
Don't bother with optimizations that involve copy-and-pasting your code (like loop unrolling), or reordering loops by hand. The compiler usually does a better job than you at doing this, but most of them aren't smart enough to undo it.
Collecting profiles of code execution get you 50% of the way there. The other 50% deals with analyzing these reports.
Further, if you use GCC or VisualC++, you can use "profile guided optimization" where the compiler will take info from previous executions and reschedule instructions to make the CPU happier.
Inline functions! Inspired by the profiling fans here I profiled an application of mine and found a small function that does some bitshifting on MP3 frames. It makes about 90% of all function calls in my applcation, so I made it inline and voila - the program now uses half of the CPU time it did before.
On most of embedded system i worked there was no profiling tools, so it's nice to say use profiler but not very practical.
First rule in speed optimization is - find your critical path.
Usually you will find that this path is not so long and not so complex. It's hard to say in generic way how to optimize this it's depend on what are you doing and what is in your power to do. For example you want usually avoid memcpy on critical path, so ever you need to use DMA or optimize, but what if you hw does not have DMA ? check if memcpy implementation is a best one if not rewrite it.
Do not use dynamic allocation at all in embedded but if you do for some reason don't do it in critical path.
Organize your thread priorities correctly, what is correctly is real question and it's clearly system specific.
We use very simple tools to analyze the bottle-necks, simple macro that store the time-stamp and index. Few (2-3) runs in 90% of cases will find where you spend your time.
And the last one is code review a very important one. In most case we avoid performance problem during code review very effective way :)
Measure performance.
Use realistic and non-trivial benchmarks. Remember that "everything is fast for small N".
Use a profiler to find hotspots.
Reduce number of dynamic memory allocations, disk accesses, database accesses, network accesses, and user/kernel transitions, because these often tend to be hotspots.
Measure performance.
In addition, you should measure performance.
Sometimes you have to decide whether it is more space or more speed that you are after, which will lead to almost opposite optimizations. For example, to get the most out of you space, you pack structures e.g. #pragma pack(1) and use bit fields in structures. For more speed you pack to align with the processors preference and avoid bitfields.
Another trick is picking the right re-sizing algorithms for growing arrays via realloc, or better still writing your own heap manager based on your particular application. Don't assume the one that comes with the compiler is the best possible solution for every application.
If someone doesn't have an answer to that question, it could be they don't know much.
It could also be that they know a lot. I know a lot (IMHO :-), and if I were asked that question, I would be asking you back: Why do you think that's important?
The problem is, any a-priori notions about performance, if they are not informed by a specific situation, are guesses by definition.
I think it is important to know coding techniques for performance, but I think it is even more important to know not to use them, until diagnosis reveals that there is a problem and what it is.
Now I'm going to contradict myself and say, if you do that, you learn how to recognize the design approaches that lead to trouble so you can avoid them, and to a novice, that sounds like premature optimization.
To give you a concrete example, this is a C application that was optimized.
Great lists. I will just add one tip I didn't saw in the above lists that in some case can yield huge optimisation for minimal cost.
bypass linker
if you have some application divided in two files, say main.c and lib.c, in many cases you can just add a \#include "lib.c" in your main.c That will completely bypass linker and allow for much more efficient optimisation for compiler.
The same effect can be achieved optimizing dependencies between files, but the cost of changes is usually higher.
Sometimes Google is the best algorithm optimization tool. When I have a complex problem, a bit of searching reveals some guys with PhD's have found a mapping between this and a well-known problem and have already done most of the work.
I would recommend optimizing using more efficient algorithms and not do it as an afterthought but code it that way from the start. Let the compiler work out the details on the small things as it knows more about the target processor than you do.
For one, I rarely use loops to look things up, I add items to a hashtable and then use the hashtable to lookup the results.
For example you have a string to lookup and then 50 possible values. So instead of doing 50 strcmps, you add all 50 strings to a hashtable and give each a unique number ( you only have to do this once ). Then you lookup the target string in the hashtable and have one large switch with all 50 cases ( or have functions pointers ).
When looking up things with common sets of input ( like css rules ), I use fast code to keep track of the only possible solitions and then iterate thought those to find a match. Once I have a match I save the results into a hashtable ( as a cache ) and then use the cache results if I get that same input set later.
My main tools for faster code are:
hashtable - for quick lookups and for caching results
qsort - it's the only sort I use
bsp - for looking up things based on area ( map rendering etc )

Writing For loops efficiently

I am constructing the partial derivative of a function in C. The process is mainly consisted of a large number of small loops. Each loop is responsible for filling a column of the matrix. Because the size of the matrix is huge, the code should be written efficiently. I have a number of plans in mind for the implementation which I do not want get into the details.
I know that the smart compilers try to take advantage of the cache automatically. But I would like to know more the details of using cache and writing an efficient code and efficient loops. It is appreciated if provide with some resources or websites so I can know more about writing the efficient codes in terms of reducing memory access time and taking advantage guy.
I know that my request my look sloppy, but I am not a computer guy. I did some research but with no success.
So, any help is appreciated.
Thanks
Well written code tends to be efficient (though not always optimal). Start by writing good clean code, and if you actually have a performance problem that can be isolated and addressed.
It is probably best that you write the code in the most readable and understandable way you can and then profile it to see where the bottlenecks really are. Often times your conception of where you need efficiency doesn't match up with reality.
Modern compilers do a decent job with many aspects of optimization and it seems unlikely that the process of looping will itself be a problem. Perhaps you should consider focusing on simplifying the calculation done by each loop.
Otherwise, you'll be looking at things such as accessing your matrix row by row so that you take advantage of the row-major storage order C uses (see this question).
You'll want to build your for loops without if statements inside because if statements create what is called "branching". The computer essentially guesses which option will be right and pays a sometimes hefty option if it is wrong.
To extend that theme, you want to do as little inside the for loop as possible. You'll also want to define it with static limits, e.g.:
for(int i=1;i<100;i++) //This is better than
for(int i=1;i<N/i;i++) //this
Static limits means that very little effort is expended determining if the for loop should keep going. They also permit you to use OpenMP to divy up the work in the loops, which can sometimes speed things up considerably. This is simple to do:
#pragma omp parallel for
for(int i=0;i<100;i++)
And, walla! the code is parallelized.

Array access/write performance differences?

This is probably going to language dependent, but in general, what is the performance difference between accessing and writing to an array?
For example, if I am trying to write a prime sieve and am representing the primes as a boolean array.
Upon finding a prime, I can say
for(int i = 2; n * i < end; i++)
{
prime[n * i] = false;
}
or
for(int i = 2; n * i < end; i++)
{
if(prime[n * i])
{
prime[n * i] = false;
}
}
The intent in the latter case is to check the value before writing it to avoid having to rewrite many values that have already been checked. Is there any realistic gain in performance here, or are access and write mostly equivalent in speed?
Impossible to answer such a generic question without the specifics of the machine/OS this is running on, but in general the latter is going to be slower because:
The second example you have to get the value from RAM to L2/L1 cache and read it to a register, make a chance on the value and write it back. In the first case you might very well get away with simply writing a value to the L1/L2 caches. It can written to RAM from the caches later while your program is doing something else.
The second form has much more code to execute per iteration. For large enough number of iterations, the difference gets big real fast.
In general this depends much more on the machine than the programing language. The writes often will take a few more clock cycles because, depending on the machine, more cache values need to be updated in memory.
However, your second segment of code will be WAY slower, and it's not just because there's "more code". The big reason is that anytime you use an if-statement on most machines the CPU uses a branch predictor. The CPU literally predicts which way the if-statement will run ahead of time, and if it's wrong it has to backtrack. See http://en.wikipedia.org/wiki/Pipeline_%28computing%29 and http://en.wikipedia.org/wiki/Branch_predictor to understand why.
If you want to do some optimization, I would recommend the following:
Profile! See what's really taking up time.
Multiplication is much harder than addition. Try rewriting the loop so that i += n, and use this for your array index.
The loop condition "should" be totally reevaluated at every iteration unless the compiler optimizes it away. So try avoiding multiplication in there.
Use -O2 or -O3 as a compiler option
You might find that some values of n are faster than others because of cache locality. You might think of some clever ways to rewrite your code to take advantage of this.
Disassemble the code and look at what it's actually doing on your processor
It's a hard question and it heavily depends on your hardware, OS and complier. But for sake of theory, you should consider two things: branching and memory access. As branching is generally evil, you want to avoid it. I wouldn't even surprise if some compiler optimization took place and your second snippet would be reduced to the first one (compilers love avoiding branches, they probably consider it as a hobby, but they have a reason). So in these terms the first example is much cleaner and easier to deal with.
There're also CPU caches and other memory related issues. I believe that in both examples you have to actually load the memory into the CPU cache, so you can either read it or update. While reading is not a problem, writing have to propagate the changes up. I wouldn't be worried if you use the function in a single thread (as #gby pointed out, OS can push the changes a little bit later).
There is only one scenario I can come up with, that would make me consider solution from your second example. If I shared the table between threads to work on it in parallel (without locking) and had separate caches for different CPUs. Then, every time you amend the cache line from one thread, the other thread have to update it's copy before reading or writing to the same memory block. It's known as a cache coherence and it actually may hurt your performance badly; in such a case I could consider conditional writes. But wait, it's probably far away from your question...

Which of the following would be more efficient?

In C:
Lets say function "Myfuny()" has 50 line of codes in which other smaller functions also get called. Which one of the following code would be more efficient?
void myfunction(long *a, long *b);
int i;
for(i=0;i<8;i++)
myfunction(&a, &b);
or
myfunction(&a, &b);
myfunction(&a, &b);
myfunction(&a, &b);
myfunction(&a, &b);
myfunction(&a, &b);
myfunction(&a, &b);
myfunction(&a, &b);
myfunction(&a, &b);
any help would be appreciated.
That's premature optimization, you just shouldn't care...
Now, from a code maintenance point of view the first form (with the loop) is definitely better.
From a run-time point of view and if the function is inline and defined in the same compilation unit, and with a compiler that does not unroll the loop itself, and if code is already in instruction cache (I don't know for moon phases, I still believe it shouldn't have any noticable effect) the second one may be marginally fastest.
As you can see, there is many conditions for it to be fastest, so you shouldn't do that. There is probably many other parameters to optimize in your program that would have a much greater effect for code speed than this one. Any change that would affect algorithmic complexity of the program will have a much greater effect. More generally speaking any code change that does not affect algorithmic complexity is probably premature optimization.
If you really want to be sure, measure. On x86 you can use the kind of trick I used in this question to get a fairly accurate measure. The trick is to read a processor register that count the number of cycles spent. The question also illustrate how code optimization questions can become tricky, even for very simple problems.
I'd assume the compiler will translate the first variant into the second.
The first. Any have half-decent compiler will optimize that for you. It's easier to read/understand and easier to write.
Secondly, write first, optimize second. Even if your compiler was completely brain dead and retarded, it at best would only save you a few nano/ms seconds on a modern CPU. Chances are there are bigger bottlenecks in your applications that could/should be optimized first.
It depends on so many things your best bet is to do it both ways and measure.
It would take less (of your) time to write out the for loop. I'd also say it's clearer to read with the loop. It would probably save a few instructions to write them out, but with modern processors and compilers it may amount to exactly the same result...
The first. It is easier to read.
First, are you sure you have a code execution performance problem? If you don't, then you're talking about making your code less readable and writable for no reason at all.
Second, have you profiled your program to see if this is in a place where it will take a significant amount of time? Humans are very bad at guessing the hot spots in programs, and without profiling you're likely to spend time and effort fiddling with things that don't make a difference.
Third, are you going to check the assembler code produced to see if there's a difference? If you're using an optimizing compiler with optimizations on, it's likely to produce what it sees fit for either. If you aren't, and you have a performance problem, get a better computer or turn on more optimizations.
Fourth, if there is a difference, are you going to test both ways to see which is better? On at least a representative sample of the systems your users will be running on?
And, to give you my best answer to which is more efficient: it depends. If they're in fact compiled to different code, the unrolled version might be faster because it doesn't have the loop overhead (which includes a conditional branch), and the rolled-up version might be faster because it's shorter code and will work better in the instruction cache. The usual wisdom was to unroll, but I once sped up a long-running section by rolling the execution up as tightly as I could.
On modern processors the size of compiled code becomes very importand. If this loop could run entirly from processor's cache it would be the fastest solution. As n8wrl said test yourself.
I created a short test for this, with surprising results. At least for me, anyway, I would've thought it was the other way round.
So, I wrote two versions of a program iterating over a function nothing(), that did nothing interesting (inc on a variable).
The first used proper loops (a million iterations of 1000 iterations, two nested fors), the second one did a million iterations of 1000 consecutive calls to nothing().
I used the time command to measure. The version with the proper loop took about 3.5 seconds on average, and the consecutive calling version took about 2.5 seconds on average.
I then tried to compile with optimization flags, but gcc detected that the program did essentially nothing and execution was instantaneous on both versions =P. Didn't bother fixing that.
Edit: if you were actually thinking of writing 8 consecutive calls in your code, please don't. Remember the famous quote: "Programs must be written for people to read, and only incidentally for machines to execute.".
Also note that my tests did nothing except nothing() (=P) and are no proper benchmarks to consider in any actual program.
Loop unrolling can make execution faster (otherwise Duff's Device wouldn't have been invented), but that's a function of so many variables (processor, cache size, compiler settings, what myfunction is actually doing, etc.) that you can't rely on it to always be true, or for whatever improvement to be worth the cost in readability and maintainability. The only way to know for sure if it makes a difference for your particular platform is to code up both versions and profile them.
Depending on what myfunction actually does, the difference could be so far down in the noise as to be undetectable.
This kind of micro-optimization should only be done if all of the following are true:
You're failing to meet a hard performance requirement;
You've already picked the proper algorithm and data structure for the problem at hand (e.g., in the average case a poorly optimized Quicksort will beat the pants off of a highly optimized bubble sort, and in the worst case they'll be equally bad);
You're compiling with the highest level of optimization that the compiler offers;
How much does myFunction(long *a, long *b) do?
If it does much more than *a = *b + 1; the cost of calling the function can be so small compared to what goes on inside the function that you are really focussing in the wrong place.
On the other hand, in the overall picture of your application program, what percent of time is spent in these 8 calls? If it's not very much, then it won't make much difference no matter how tightly you optimize it.
As others say, profile, but that's not necessarily as simple as it sounds. Here's the method I and some others use.

C coding practices for performance or code size - beyond what a compiler does

I'm looking to see what can a programmer do in C, that can determine the performance and/or the size of the generated object file.
For e.g,
1. Declaring simple get/set functions as inline may increase performance (at the cost of a larger footprint)
2. For loops that do not use the value of the loop variable itself, count down to zero instead of counting up to a certain value
etc.
It looks like compilers now have advanced to a level where "simple" tricks (like the two points above) are not required at all. Appropriate options during compilation do the job anyway. Heck, I also saw posts here on how compilers handle recursion - that was very interesting! So what are we left to do at a C level then? :)
My specific environment is: GCC 4.3.3 re-targeted for ARM architecture (v4). But responses on other compilers/processors are also welcome and will be munched upon.
PS: This approach of mine goes against the usual "code first!, then benchmark, and finally optimize" approach.
Edit: Just like it so happens, I found a similar post after posting the question: Should we still be optimizing "in the small"?
One thing I can think of that a compiler probably won't optimize is "cache-friendliness": If you're iterating over a two-dimensional array in row-major order, say, make sure your inner loop runs across the column index to avoid cache thrashing. Having the inner loop run over the wrong index can cause a huge performance hit.
This applies to all programming languages, but if you're programming in C, performance is probably critical to you, so it's especially relevant.
"Always" know the time and space complexity of your algorithms. The compiler will never be able to do that job as well as you can. :)
Compilers these days still aren't very good at vectorizing your code so you'll still want to do the SIMD implementation of most algorithms yourself.
Choosing the right datastructures for your exact problem can dramatically increase performance (I've seen cases where moving from a Kd-tree to a BVH would do that, in that specific case).
Compilers might pad some structs/ variables to fit into the cache but other cache optimizations such as the locality of your data are still up to you.
Compilers still don't automatically make your code multithreaded and using openmp, in my experience, doesn't really help much. (You really have to understand openmp anyway to dramatically increase performance). So currently, you're on your own doing multithreading.
To add to what Martin says above about cache-friendliness:
reordering your structures such that fields which are commonly accessed together are in the same cache line can help (for instance by loading just one cache line rather than two.) You are essentially increasing the density of useful data in your data cache by doing this. There is a linux tool which can help you in doing this: dwarves 1. http://www.linuxinsight.com/files/ols2007/melo-reprint.pdf
you can use a similar strategy for increasing density of your code. In gcc you can mark hot and cold branches using likely/unlikely tags. That enables gcc to keep the cold branches separately which helps in increasing the icache density.
And now for something completely different:
for fields that might be accessed (read and written) across CPUs, the opposite strategy makes sense. The trouble is that for coherence purposes only one CPU can be allowed to write to the same address (in reality the same cacheline.) This can lead to a condition called cache-line ping pong. This is pretty bad and could be worse if that cache-line contains other unrelated data. Here, padding this contended data to a cache-line length makes sense.
Note: these clearly are micro-optimizations, to be done only at later stages when you are trying to wring the last bits of performance from your code.
PreComputation where possible... (sorry but its not always possible... I did extensive precomputation on my chess engine.) Store those results in memory, keeping cache in mind.. the bigger the size of precomputation data in memory the lesser is the chance of doing a cache hit. Since most of recent hardware is multicore you can design your application to target it.
if you are using several big arrays make sure you group them close to each other on where they would be used, boosting cache hits
Many people are not aware of this: Define an inline label (varies by compiler) which means inline, in its intent - many compilers place the keyword in an entirely different context from the original meaning. There are also ways to increase the inline size limits, before the compiler begins popping trivial things out of line. Human directed inlining can produce much faster code (compilers are often conservative, or do not account for enough of the program), but you need to learn to use it correctly, because it can (easily) be counterproductive. And yes, this absolutely applies to code size as well as speed.

Resources