Array access/write performance differences?

Array access/write performance differences? - arrays

This is probably going to language dependent, but in general, what is the performance difference between accessing and writing to an array?
For example, if I am trying to write a prime sieve and am representing the primes as a boolean array.
Upon finding a prime, I can say
for(int i = 2; n * i < end; i++)
{
prime[n * i] = false;
}
or
for(int i = 2; n * i < end; i++)
{
if(prime[n * i])
{
prime[n * i] = false;
}
}
The intent in the latter case is to check the value before writing it to avoid having to rewrite many values that have already been checked. Is there any realistic gain in performance here, or are access and write mostly equivalent in speed?

Impossible to answer such a generic question without the specifics of the machine/OS this is running on, but in general the latter is going to be slower because:
The second example you have to get the value from RAM to L2/L1 cache and read it to a register, make a chance on the value and write it back. In the first case you might very well get away with simply writing a value to the L1/L2 caches. It can written to RAM from the caches later while your program is doing something else.
The second form has much more code to execute per iteration. For large enough number of iterations, the difference gets big real fast.

In general this depends much more on the machine than the programing language. The writes often will take a few more clock cycles because, depending on the machine, more cache values need to be updated in memory.
However, your second segment of code will be WAY slower, and it's not just because there's "more code". The big reason is that anytime you use an if-statement on most machines the CPU uses a branch predictor. The CPU literally predicts which way the if-statement will run ahead of time, and if it's wrong it has to backtrack. See http://en.wikipedia.org/wiki/Pipeline_%28computing%29 and http://en.wikipedia.org/wiki/Branch_predictor to understand why.
If you want to do some optimization, I would recommend the following:
Profile! See what's really taking up time.
Multiplication is much harder than addition. Try rewriting the loop so that i += n, and use this for your array index.
The loop condition "should" be totally reevaluated at every iteration unless the compiler optimizes it away. So try avoiding multiplication in there.
Use -O2 or -O3 as a compiler option
You might find that some values of n are faster than others because of cache locality. You might think of some clever ways to rewrite your code to take advantage of this.
Disassemble the code and look at what it's actually doing on your processor

It's a hard question and it heavily depends on your hardware, OS and complier. But for sake of theory, you should consider two things: branching and memory access. As branching is generally evil, you want to avoid it. I wouldn't even surprise if some compiler optimization took place and your second snippet would be reduced to the first one (compilers love avoiding branches, they probably consider it as a hobby, but they have a reason). So in these terms the first example is much cleaner and easier to deal with.
There're also CPU caches and other memory related issues. I believe that in both examples you have to actually load the memory into the CPU cache, so you can either read it or update. While reading is not a problem, writing have to propagate the changes up. I wouldn't be worried if you use the function in a single thread (as #gby pointed out, OS can push the changes a little bit later).
There is only one scenario I can come up with, that would make me consider solution from your second example. If I shared the table between threads to work on it in parallel (without locking) and had separate caches for different CPUs. Then, every time you amend the cache line from one thread, the other thread have to update it's copy before reading or writing to the same memory block. It's known as a cache coherence and it actually may hurt your performance badly; in such a case I could consider conditional writes. But wait, it's probably far away from your question...

Related

Idiomatic way of performance evaluation?

I am evaluating a network+rendering workload for my project.
The program continuously runs a main loop:
while (true) {
doSomething()
drawSomething()
doSomething2()
sendSomething()
}
The main loop runs more than 60 times per second.
I want to see the performance breakdown, how much time each procedure takes.
My concern is that if I print the time interval for every entrance and exit of each procedure,
It would incur huge performance overhead.
I am curious what is an idiomatic way of measuring the performance.
Printing of logging is good enough?

Generally: For repeated short things, you can just time the whole repeat loop. (But microbenchmarking is hard; easy to distort results unless you understand the implications of doing that; for very short things, throughput and latency are different, so measure both separately by making one iteration use the result of the previous or not. Also beware that branch prediction and caching can make something look fast in a microbenchmark when it would actually be costly if done one at a time between other work in a larger program.
e.g. loop unrolling and lookup tables often look good because there's no pressure on I-cache or D-cache from anything else.)
Or if you insist on timing each separate iteration, record the results in an array and print later; you don't want to invoke heavy-weight printing code inside your loop.
This question is way too broad to say anything more specific.
Many languages have benchmarking packages that will help you write microbenchmarks of a single function. Use them. e.g. for Java, JMH makes sure the function under test is warmed up and fully optimized by the JIT, and all that jazz, before doing timed runs. And runs it for a specified interval, counting how many iterations it completes. See How do I write a correct micro-benchmark in Java? for that and more.
Beware common microbenchmark pitfalls
Failure to warm up code / data caches and stuff: page faults within the timed region for touching new memory, or code / data cache misses, that wouldn't be part of normal operation. (Example of noticing this effect: Performance: memset; or example of a wrong conclusion based on this mistake)
Never-written memory (obtained fresh from the kernel) gets all its pages copy-on-write mapped to the same system-wide physical page (4K or 2M) of zeros if you read without writing, at least on Linux. So you can get cache hits but TLB misses. e.g. A large allocation from new / calloc / malloc, or a zero-initialized array in static storage in .bss. Use a non-zero initializer or memset.
Failure to give the CPU time to ramp up to max turbo: modern CPUs clock down to idle speeds to save power, only clocking up after a few milliseconds. (Or longer depending on the OS / HW).
related: on modern x86, RDTSC counts reference cycles, not core clock cycles, so it's subject to the same CPU-frequency variation effects as wall-clock time.
Most integer and FP arithmetic asm instructions (except divide and square root which are already slower than others) have performance (latency and throughput) that doesn't depend on the actual data. Except for subnormal aka denormal floating point being very slow, and in some cases (e.g. legacy x87 but not SSE2) also producing NaN or Inf can be slow.
On modern CPUs with out-of-order execution, some things are too short to truly time meaningfully, see also this. Performance of a tiny block of assembly language (e.g. generated by a compiler for one function) can't be characterized by a single number, even if it doesn't branch or access memory (so no chance of mispredict or cache miss). It has latency from inputs to outputs, but different throughput if run repeatedly with independent inputs is higher. e.g. an add instruction on a Skylake CPU has 4/clock throughput, but 1 cycle latency. So dummy = foo(x) can be 4x faster than x = foo(x); in a loop. Floating-point instructions have higher latency than integer, so it's often a bigger deal. Memory access is also pipelined on most CPUs, so looping over an array (address for next load easy to calculate) is often much faster than walking a linked list (address for next load isn't available until the previous load completes).
Obviously performance can differ between CPUs; in the big picture usually it's rare for version A to be faster on Intel, version B to be faster on AMD, but that can easily happen in the small scale. When reporting / recording benchmark numbers, always note what CPU you tested on.
Related to the above and below points: you can't "benchmark the * operator" in C in general, for example. Some use-cases for it will compile very differently from others, e.g. tmp = foo * i; in a loop can often turn into tmp += foo (strength reduction), or if the multiplier is a constant power of 2 the compiler will just use a shift. The same operator in the source can compile to very different instructions, depending on surrounding code.
You need to compile with optimization enabled, but you also need to stop the compiler from optimizing away the work, or hoisting it out of a loop. Make sure you use the result (e.g. print it or store it to a volatile) so the compiler has to produce it. For an array, volatile double sink = output[argc]; is a useful trick: the compiler doesn't know the value of argc so it has to generate the whole array, but you don't need to read the whole array or even call an RNG function. (Unless the compiler aggressively transforms to only calculate the one output selected by argc, but that tends not to be a problem in practice.)
For inputs, use a random number or argc or something instead of a compile-time constant so your compiler can't do constant-propagation for things that won't be constants in your real use-case. In C you can sometimes use inline asm or volatile for this, e.g. the stuff this question is asking about. A good benchmarking package like Google Benchmark will include functions for this.
If the real use-case for a function lets it inline into callers where some inputs are constant, or the operations can be optimized into other work, it's not very useful to benchmark it on its own.
Big complicated functions with special handling for lots of special cases can look fast in a microbenchmark when you run them repeatedly, especially with the same input every time. In real life use-cases, branch prediction often won't be primed for that function with that input. Also, a massively unrolled loop can look good in a microbenchmark, but in real life it slows everything else down with its big instruction-cache footprint leading to eviction of other code.
Related to that last point: Don't tune only for huge inputs, if the real use-case for a function includes a lot of small inputs. e.g. a memcpy implementation that's great for huge inputs but takes too long to figure out which strategy to use for small inputs might not be good. It's a tradeoff; make sure it's good enough for large inputs (for an appropriate definition of "enough"), but also keep overhead low for small inputs.
Litmus tests:
If you're benchmarking two functions in one program: if reversing the order of testing changes the results, your benchmark isn't fair. e.g. function A might only look slow because you're testing it first, with insufficient warm-up. example: Why is std::vector slower than an array? (it's not, whichever loop runs first has to pay for all the page faults and cache misses; the 2nd just zooms through filling the same memory.)
Increasing the iteration count of a repeat loop should linearly increase the total time, and not affect the calculated time-per-call. If not, then you have non-negligible measurement overhead or your code optimized away (e.g. hoisted out of the loop and runs only once instead of N times).
Vary other test parameters as a sanity check.
For C / C++, see also Simple for() loop benchmark takes the same time with any loop bound where I went into some more detail about microbenchmarking and using volatile or asm to stop important work from optimizing away with gcc/clang.

Is there "compiler-friendly" code / convention [duplicate]

Many years ago, C compilers were not particularly smart. As a workaround K&R invented the register keyword, to hint to the compiler, that maybe it would be a good idea to keep this variable in an internal register. They also made the tertiary operator to help generate better code.
As time passed, the compilers matured. They became very smart in that their flow analysis allowing them to make better decisions about what values to hold in registers than you could possibly do. The register keyword became unimportant.
FORTRAN can be faster than C for some sorts of operations, due to alias issues. In theory with careful coding, one can get around this restriction to enable the optimizer to generate faster code.
What coding practices are available that may enable the compiler/optimizer to generate faster code?
Identifying the platform and compiler you use, would be appreciated.
Why does the technique seem to work?
Sample code is encouraged.
Here is a related question
[Edit] This question is not about the overall process to profile, and optimize. Assume that the program has been written correctly, compiled with full optimization, tested and put into production. There may be constructs in your code that prohibit the optimizer from doing the best job that it can. What can you do to refactor that will remove these prohibitions, and allow the optimizer to generate even faster code?
[Edit] Offset related link

Here's a coding practice to help the compiler create fast code—any language, any platform, any compiler, any problem:
Do not use any clever tricks which force, or even encourage, the compiler to lay variables out in memory (including cache and registers) as you think best. First write a program which is correct and maintainable.
Next, profile your code.
Then, and only then, you might want to start investigating the effects of telling the compiler how to use memory. Make 1 change at a time and measure its impact.
Expect to be disappointed and to have to work very hard indeed for small performance improvements. Modern compilers for mature languages such as Fortran and C are very, very good. If you read an account of a 'trick' to get better performance out of code, bear in mind that the compiler writers have also read about it and, if it is worth doing, probably implemented it. They probably wrote what you read in the first place.

Write to local variables and not output arguments! This can be a huge help for getting around aliasing slowdowns. For example, if your code looks like
void DoSomething(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
for (int i=0; i<numFoo, i++)
{
barOut.munge(foo1, foo2[i]);
}
}
the compiler doesn't know that foo1 != barOut, and thus has to reload foo1 each time through the loop. It also can't read foo2[i] until the write to barOut is finished. You could start messing around with restricted pointers, but it's just as effective (and much clearer) to do this:
void DoSomethingFaster(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
Foo barTemp = barOut;
for (int i=0; i<numFoo, i++)
{
barTemp.munge(foo1, foo2[i]);
}
barOut = barTemp;
}
It sounds silly, but the compiler can be much smarter dealing with the local variable, since it can't possibly overlap in memory with any of the arguments. This can help you avoid the dreaded load-hit-store (mentioned by Francis Boivin in this thread).

The order you traverse memory can have profound impacts on performance and compilers aren't really good at figuring that out and fixing it. You have to be conscientious of cache locality concerns when you write code if you care about performance. For example two-dimensional arrays in C are allocated in row-major format. Traversing arrays in column major format will tend to make you have more cache misses and make your program more memory bound than processor bound:
#define N 1000000;
int matrix[N][N] = { ... };
//awesomely fast
long sum = 0;
for(int i = 0; i < N; i++){
for(int j = 0; j < N; j++){
sum += matrix[i][j];
}
}
//painfully slow
long sum = 0;
for(int i = 0; i < N; i++){
for(int j = 0; j < N; j++){
sum += matrix[j][i];
}
}

Generic Optimizations
Here as some of my favorite optimizations. I have actually increased execution times and reduced program sizes by using these.
Declare small functions as inline or macros
Each call to a function (or method) incurs overhead, such as pushing variables onto the stack. Some functions may incur an overhead on return as well. An inefficient function or method has fewer statements in its content than the combined overhead. These are good candidates for inlining, whether it be as #define macros or inline functions. (Yes, I know inline is only a suggestion, but in this case I consider it as a reminder to the compiler.)
Remove dead and redundant code
If the code isn't used or does not contribute to the program's result, get rid of it.
Simplify design of algorithms
I once removed a lot of assembly code and execution time from a program by writing down the algebraic equation it was calculating and then simplified the algebraic expression. The implementation of the simplified algebraic expression took up less room and time than the original function.
Loop Unrolling
Each loop has an overhead of incrementing and termination checking. To get an estimate of the performance factor, count the number of instructions in the overhead (minimum 3: increment, check, goto start of loop) and divide by the number of statements inside the loop. The lower the number the better.
Edit: provide an example of loop unrolling
Before:
unsigned int sum = 0;
for (size_t i; i < BYTES_TO_CHECKSUM; ++i)
{
sum += *buffer++;
}
After unrolling:
unsigned int sum = 0;
size_t i = 0;
**const size_t STATEMENTS_PER_LOOP = 8;**
for (i = 0; i < BYTES_TO_CHECKSUM; **i = i / STATEMENTS_PER_LOOP**)
{
sum += *buffer++; // 1
sum += *buffer++; // 2
sum += *buffer++; // 3
sum += *buffer++; // 4
sum += *buffer++; // 5
sum += *buffer++; // 6
sum += *buffer++; // 7
sum += *buffer++; // 8
}
// Handle the remainder:
for (; i < BYTES_TO_CHECKSUM; ++i)
{
sum += *buffer++;
}
In this advantage, a secondary benefit is gained: more statements are executed before the processor has to reload the instruction cache.
I've had amazing results when I unrolled a loop to 32 statements. This was one of the bottlenecks since the program had to calculate a checksum on a 2GB file. This optimization combined with block reading improved performance from 1 hour to 5 minutes. Loop unrolling provided excellent performance in assembly language too, my memcpy was a lot faster than the compiler's memcpy. -- T.M.
Reduction of if statements
Processors hate branches, or jumps, since it forces the processor to reload its queue of instructions.
Boolean Arithmetic (Edited: applied code format to code fragment, added example)
Convert if statements into boolean assignments. Some processors can conditionally execute instructions without branching:
bool status = true;
status = status && /* first test */;
status = status && /* second test */;
The short circuiting of the Logical AND operator (&&) prevents execution of the tests if the status is false.
Example:
struct Reader_Interface
{
virtual bool write(unsigned int value) = 0;
};
struct Rectangle
{
unsigned int origin_x;
unsigned int origin_y;
unsigned int height;
unsigned int width;
bool write(Reader_Interface * p_reader)
{
bool status = false;
if (p_reader)
{
status = p_reader->write(origin_x);
status = status && p_reader->write(origin_y);
status = status && p_reader->write(height);
status = status && p_reader->write(width);
}
return status;
};
Factor Variable Allocation outside of loops
If a variable is created on the fly inside a loop, move the creation / allocation to before the loop. In most instances, the variable doesn't need to be allocated during each iteration.
Factor constant expressions outside of loops
If a calculation or variable value does not depend on the loop index, move it outside (before) the loop.
I/O in blocks
Read and write data in large chunks (blocks). The bigger the better. For example, reading one octect at a time is less efficient than reading 1024 octets with one read.
Example:
static const char Menu_Text[] = "\n"
"1) Print\n"
"2) Insert new customer\n"
"3) Destroy\n"
"4) Launch Nasal Demons\n"
"Enter selection: ";
static const size_t Menu_Text_Length = sizeof(Menu_Text) - sizeof('\0');
//...
std::cout.write(Menu_Text, Menu_Text_Length);
The efficiency of this technique can be visually demonstrated. :-)
Don't use printf family for constant data
Constant data can be output using a block write. Formatted write will waste time scanning the text for formatting characters or processing formatting commands. See above code example.
Format to memory, then write
Format to a char array using multiple sprintf, then use fwrite. This also allows the data layout to be broken up into "constant sections" and variable sections. Think of mail-merge.
Declare constant text (string literals) as static const
When variables are declared without the static, some compilers may allocate space on the stack and copy the data from ROM. These are two unnecessary operations. This can be fixed by using the static prefix.
Lastly, Code like the compiler would
Sometimes, the compiler can optimize several small statements better than one complicated version. Also, writing code to help the compiler optimize helps too. If I want the compiler to use special block transfer instructions, I will write code that looks like it should use the special instructions.

The optimizer isn't really in control of the performance of your program, you are. Use appropriate algorithms and structures and profile, profile, profile.
That said, you shouldn't inner-loop on a small function from one file in another file, as that stops it from being inlined.
Avoid taking the address of a variable if possible. Asking for a pointer isn't "free" as it means the variable needs to be kept in memory. Even an array can be kept in registers if you avoid pointers — this is essential for vectorizing.
Which leads to the next point, read the ^#$# manual! GCC can vectorize plain C code if you sprinkle a __restrict__ here and an __attribute__( __aligned__ ) there. If you want something very specific from the optimizer, you might have to be specific.

On most modern processors, the biggest bottleneck is memory.
Aliasing: Load-Hit-Store can be devastating in a tight loop. If you're reading one memory location and writing to another and know that they are disjoint, carefully putting an alias keyword on the function parameters can really help the compiler generate faster code. However if the memory regions do overlap and you used 'alias', you're in for a good debugging session of undefined behaviors!
Cache-miss: Not really sure how you can help the compiler since it's mostly algorithmic, but there are intrinsics to prefetch memory.
Also don't try to convert floating point values to int and vice versa too much since they use different registers and converting from one type to another means calling the actual conversion instruction, writing the value to memory and reading it back in the proper register set.

The vast majority of code that people write will be I/O bound (I believe all the code I have written for money in the last 30 years has been so bound), so the activities of the optimiser for most folks will be academic.
However, I would remind people that for the code to be optimised you have to tell the compiler to to optimise it - lots of people (including me when I forget) post C++ benchmarks here that are meaningless without the optimiser being enabled.

use const correctness as much as possible in your code. It allows the compiler to optimize much better.
In this document are loads of other optimization tips: CPP optimizations (a bit old document though)
highlights:
use constructor initialization lists
use prefix operators
use explicit constructors
inline functions
avoid temporary objects
be aware of the cost of virtual functions
return objects via reference parameters
consider per class allocation
consider stl container allocators
the 'empty member' optimization
etc

Attempt to program using static single assignment as much as possible. SSA is exactly the same as what you end up with in most functional programming languages, and that's what most compilers convert your code to to do their optimizations because it's easier to work with. By doing this places where the compiler might get confused are brought to light. It also makes all but the worst register allocators work as good as the best register allocators, and allows you to debug more easily because you almost never have to wonder where a variable got it's value from as there was only one place it was assigned.
Avoid global variables.
When working with data by reference or pointer pull that into local variables, do your work, and then copy it back. (unless you have a good reason not to)
Make use of the almost free comparison against 0 that most processors give you when doing math or logic operations. You almost always get a flag for ==0 and <0, from which you can easily get 3 conditions:
x= f();
if(!x){
a();
} else if (x<0){
b();
} else {
c();
}
is almost always cheaper than testing for other constants.
Another trick is to use subtraction to eliminate one compare in range testing.
#define FOO_MIN 8
#define FOO_MAX 199
int good_foo(int foo) {
unsigned int bar = foo-FOO_MIN;
int rc = ((FOO_MAX-FOO_MIN) < bar) ? 1 : 0;
return rc;
}
This can very often avoid a jump in languages that do short circuiting on boolean expressions and avoids the compiler having to try to figure out how to handle keeping
up with the result of the first comparison while doing the second and then combining them.
This may look like it has the potential to use up an extra register, but it almost never does. Often you don't need foo anymore anyway, and if you do rc isn't used yet so it can go there.
When using the string functions in c (strcpy, memcpy, ...) remember what they return -- the destination! You can often get better code by 'forgetting' your copy of the pointer to destination and just grab it back from the return of these functions.
Never overlook the oppurtunity to return exactly the same thing the last function you called returned. Compilers are not so great at picking up that:
foo_t * make_foo(int a, int b, int c) {
foo_t * x = malloc(sizeof(foo));
if (!x) {
// return NULL;
return x; // x is NULL, already in the register used for returns, so duh
}
x->a= a;
x->b = b;
x->c = c;
return x;
}
Of course, you could reverse the logic on that if and only have one return point.
(tricks I recalled later)
Declaring functions as static when you can is always a good idea. If the compiler can prove to itself that it has accounted for every caller of a particular function then it can break the calling conventions for that function in the name of optimization. Compilers can often avoid moving parameters into registers or stack positions that called functions usually expect their parameters to be in (it has to deviate in both the called function and the location of all callers to do this). The compiler can also often take advantage of knowing what memory and registers the called function will need and avoid generating code to preserve variable values that are in registers or memory locations that the called function doesn't disturb. This works particularly well when there are few calls to a function. This gets much of the benifit of inlining code, but without actually inlining.

I wrote an optimizing C compiler and here are some very useful things to consider:
Make most functions static. This allows interprocedural constant propagation and alias analysis to do its job, otherwise the compiler needs to presume that the function can be called from outside the translation unit with completely unknown values for the paramters. If you look at the well-known open-source libraries they all mark functions static except the ones that really need to be extern.
If global variables are used, mark them static and constant if possible. If they are initialized once (read-only), it's better to use an initializer list like static const int VAL[] = {1,2,3,4}, otherwise the compiler might not discover that the variables are actually initialized constants and will fail to replace loads from the variable with the constants.
NEVER use a goto to the inside of a loop, the loop will not be recognized anymore by most compilers and none of the most important optimizations will be applied.
Use pointer parameters only if necessary, and mark them restrict if possible. This helps alias analysis a lot because the programmer guarantees there is no alias (the interprocedural alias analysis is usually very primitive). Very small struct objects should be passed by value, not by reference.
Use arrays instead of pointers whenever possible, especially inside loops (a[i]). An array usually offers more information for alias analysis and after some optimizations the same code will be generated anyway (search for loop strength reduction if curious). This also increases the chance for loop-invariant code motion to be applied.
Try to hoist outside the loop calls to large functions or external functions that don't have side-effects (don't depend on the current loop iteration). Small functions are in many cases inlined or converted to intrinsics that are easy to hoist, but large functions might seem for the compiler to have side-effects when they actually don't. Side-effects for external functions are completely unknown, with the exception of some functions from the standard library which are sometimes modeled by some compilers, making loop-invariant code motion possible.
When writing tests with multiple conditions place the most likely one first. if(a || b || c) should be if(b || a || c) if b is more likely to be true than the others. Compilers usually don't know anything about the possible values of the conditions and which branches are taken more (they could be known by using profile information, but few programmers use it).
Using a switch is faster than doing a test like if(a || b || ... || z). Check first if your compiler does this automatically, some do and it's more readable to have the if though.

In the case of embedded systems and code written in C/C++, I try and avoid dynamic memory allocation as much as possible. The main reason I do this is not necessarily performance but this rule of thumb does have performance implications.
Algorithms used to manage the heap are notoriously slow in some platforms (e.g., vxworks). Even worse, the time that it takes to return from a call to malloc is highly dependent on the current state of the heap. Therefore, any function that calls malloc is going to take a performance hit that cannot be easily accounted for. That performance hit may be minimal if the heap is still clean but after that device runs for a while the heap can become fragmented. The calls are going to take longer and you cannot easily calculate how performance will degrade over time. You cannot really produce a worse case estimate. The optimizer cannot provide you with any help in this case either. To make matters even worse, if the heap becomes too heavily fragmented, the calls will start failing altogether. The solution is to use memory pools (e.g., glib slices ) instead of the heap. The allocation calls are going to be much faster and deterministic if you do it right.

A dumb little tip, but one that will save you some microscopic amounts of speed and code.
Always pass function arguments in the same order.
If you have f_1(x, y, z) which calls f_2, declare f_2 as f_2(x, y, z). Do not declare it as f_2(x, z, y).
The reason for this is that C/C++ platform ABI (AKA calling convention) promises to pass arguments in particular registers and stack locations. When the arguments are already in the correct registers then it does not have to move them around.
While reading disassembled code I've seen some ridiculous register shuffling because people didn't follow this rule.

Two coding technics I didn't saw in the above list:
Bypass linker by writing code as an unique source
While separate compilation is really nice for compiling time, it is very bad when you speak of optimization. Basically the compiler can't optimize beyond compilation unit, that is linker reserved domain.
But if you design well your program you can can also compile it through an unique common source. That is instead of compiling unit1.c and unit2.c then link both objects, compile all.c that merely #include unit1.c and unit2.c. Thus you will benefit from all the compiler optimizations.
It's very like writing headers only programs in C++ (and even easier to do in C).
This technique is easy enough if you write your program to enable it from the beginning, but you must also be aware it change part of C semantic and you can meet some problems like static variables or macro collision. For most programs it's easy enough to overcome the small problems that occurs. Also be aware that compiling as an unique source is way slower and may takes huge amount of memory (usually not a problem with modern systems).
Using this simple technique I happened to make some programs I wrote ten times faster!
Like the register keyword, this trick could also become obsolete soon. Optimizing through linker begin to be supported by compilers gcc: Link time optimization.
Separate atomic tasks in loops
This one is more tricky. It's about interaction between algorithm design and the way optimizer manage cache and register allocation. Quite often programs have to loop over some data structure and for each item perform some actions. Quite often the actions performed can be splitted between two logically independent tasks. If that is the case you can write exactly the same program with two loops on the same boundary performing exactly one task. In some case writing it this way can be faster than the unique loop (details are more complex, but an explanation can be that with the simple task case all variables can be kept in processor registers and with the more complex one it's not possible and some registers must be written to memory and read back later and the cost is higher than additional flow control).
Be careful with this one (profile performances using this trick or not) as like using register it may as well give lesser performances than improved ones.

I've actually seen this done in SQLite and they claim it results in performance boosts ~5%: Put all your code in one file or use the preprocessor to do the equivalent to this. This way the optimizer will have access to the entire program and can do more interprocedural optimizations.

Most modern compilers should do a good job speeding up tail recursion, because the function calls can be optimized out.
Example:
int fac2(int x, int cur) {
if (x == 1) return cur;
return fac2(x - 1, cur * x);
}
int fac(int x) {
return fac2(x, 1);
}
Of course this example doesn't have any bounds checking.
Late Edit
While I have no direct knowledge of the code; it seems clear that the requirements of using CTEs on SQL Server were specifically designed so that it can optimize via tail-end recursion.

Don't do the same work over and over again!
A common antipattern that I see goes along these lines:
void Function()
{
MySingleton::GetInstance()->GetAggregatedObject()->DoSomething();
MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingElse();
MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingCool();
MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingReallyNeat();
MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingYetAgain();
}
The compiler actually has to call all of those functions all of the time. Assuming you, the programmer, knows that the aggregated object isn't changing over the course of these calls, for the love of all that is holy...
void Function()
{
MySingleton* s = MySingleton::GetInstance();
AggregatedObject* ao = s->GetAggregatedObject();
ao->DoSomething();
ao->DoSomethingElse();
ao->DoSomethingCool();
ao->DoSomethingReallyNeat();
ao->DoSomethingYetAgain();
}
In the case of the singleton getter the calls may not be too costly, but it is certainly a cost (typically, "check to see if the object has been created, if it hasn't, create it, then return it). The more complicated this chain of getters becomes, the more wasted time we'll have.

Use the most local scope possible for all variable declarations.
Use const whenever possible
Dont use register unless you plan to profile both with and without it
The first 2 of these, especially #1 one help the optimizer analyze the code. It will especially help it to make good choices about what variables to keep in registers.
Blindly using the register keyword is as likely to help as hurt your optimization, It's just too hard to know what will matter until you look at the assembly output or profile.
There are other things that matter to getting good performance out of code; designing your data structures to maximize cache coherency for instance. But the question was about the optimizer.

Align your data to native/natural boundaries.

I was reminded of something that I encountered once, where the symptom was simply that we were running out of memory, but the result was substantially increased performance (as well as huge reductions in memory footprint).
The problem in this case was that the software we were using made tons of little allocations. Like, allocating four bytes here, six bytes there, etc. A lot of little objects, too, running in the 8-12 byte range. The problem wasn't so much that the program needed lots of little things, it's that it allocated lots of little things individually, which bloated each allocation out to (on this particular platform) 32 bytes.
Part of the solution was to put together an Alexandrescu-style small object pool, but extend it so I could allocate arrays of small objects as well as individual items. This helped immensely in performance as well since more items fit in the cache at any one time.
The other part of the solution was to replace the rampant use of manually-managed char* members with an SSO (small-string optimization) string. The minimum allocation being 32 bytes, I built a string class that had an embedded 28-character buffer behind a char*, so 95% of our strings didn't need to do an additional allocation (and then I manually replaced almost every appearance of char* in this library with this new class, that was fun or not). This helped a ton with memory fragmentation as well, which then increased the locality of reference for other pointed-to objects, and similarly there were performance gains.

A neat technique I learned from #MSalters comment on this answer allows compilers to do copy elision even when returning different objects according to some condition:
// before
BigObject a, b;
if(condition)
return a;
else
return b;
// after
BigObject a, b;
if(condition)
swap(a,b);
return a;

If you've got small functions you call repeatedly, i have in the past got large gains by putting them in headers as "static inline". Function calls on the ix86 are surprisingly expensive.
Reimplementing recursive functions in a non-recursive way using an explicit stack can also gain a lot, but then you really are in the realm of development time vs gain.

Here's my second piece of optimisation advice. As with my first piece of advice this is general purpose, not language or processor specific.
Read the compiler manual thoroughly and understand what it is telling you. Use the compiler to its utmost.
I agree with one or two of the other respondents who have identified selecting the right algorithm as critical to squeezing performance out of a program. Beyond that the rate of return (measured in code execution improvement) on the time you invest in using the compiler is far higher than the rate of return in tweaking the code.
Yes, compiler writers are not from a race of coding giants and compilers contain mistakes and what should, according to the manual and according to compiler theory, make things faster sometimes makes things slower. That's why you have to take one step at a time and measure before- and after-tweak performance.
And yes, ultimately, you might be faced with a combinatorial explosion of compiler flags so you need to have a script or two to run make with various compiler flags, queue the jobs on the large cluster and gather the run time statistics. If it's just you and Visual Studio on a PC you will run out of interest long before you have tried enough combinations of enough compiler flags.
Regards
Mark
When I first pick up a piece of code I can usually get a factor of 1.4 -- 2.0 times more performance (ie the new version of the code runs in 1/1.4 or 1/2 of the time of the old version) within a day or two by fiddling with compiler flags. Granted, that may be a comment on the lack of compiler savvy among the scientists who originate much of the code I work on, rather than a symptom of my excellence. Having set the compiler flags to max (and it's rarely just -O3) it can take months of hard work to get another factor of 1.05 or 1.1

When DEC came out with its alpha processors, there was a recommendation to keep the number of arguments to a function under 7, as the compiler would always try to put up to 6 arguments in registers automatically.

For performance, focus first on writing maintenable code - componentized, loosely coupled, etc, so when you have to isolate a part either to rewrite, optimize or simply profile, you can do it without much effort.
Optimizer will help your program's performance marginally.

You're getting good answers here, but they assume your program is pretty close to optimal to begin with, and you say
Assume that the program has been
written correctly, compiled with full
optimization, tested and put into
production.
In my experience, a program may be written correctly, but that does not mean it is near optimal. It takes extra work to get to that point.
If I can give an example, this answer shows how a perfectly reasonable-looking program was made over 40 times faster by macro-optimization. Big speedups can't be done in every program as first written, but in many (except for very small programs), it can, in my experience.
After that is done, micro-optimization (of the hot-spots) can give you a good payoff.

i use intel compiler. on both Windows and Linux.
when more or less done i profile the code. then hang on the hotspots and trying to change the code to allow compiler make a better job.
if a code is a computational one and contain a lot of loops - vectorization report in intel compiler is very helpful - look for 'vec-report' in help.
so the main idea - polish the performance critical code. as for the rest - priority to be correct and maintainable - short functions, clear code that could be understood 1 year later.

One optimization i have used in C++ is creating a constructor that does nothing. One must manually call an init() in order to put the object into a working state.
This has benefit in the case where I need a large vector of these classes.
I call reserve() to allocate the space for the vector, but the constructor does not actually touch the page of memory the object is on. So I have spent some address space, but not actually consumed a lot of physical memory. I avoid the page faults associated the associated construction costs.
As i generate objects to fill the vector, I set them using init(). This limits my total page faults, and avoids the need to resize() the vector while filling it.

One thing I've done is try to keep expensive actions to places where the user might expect the program to delay a bit. Overall performance is related to responsiveness, but isn't quite the same, and for many things responsiveness is the more important part of performance.
The last time I really had to do improvements in overall performance, I kept an eye out for suboptimal algorithms, and looked for places that were likely to have cache problems. I profiled and measured performance first, and again after each change. Then the company collapsed, but it was interesting and instructive work anyway.

I have long suspected, but never proved that declaring arrays so that they hold a power of 2, as the number of elements, enables the optimizer to do a strength reduction by replacing a multiply by a shift by a number of bits, when looking up individual elements.

Put small and/or frequently called functions at the top of the source file. That makes it easier for the compiler to find opportunities for inlining.

Counting FLOPs and size of data and check whether function is memory-bound or cpu-bound

I am going to analyse and optimize some C-Code and therefore I first have to check, whether the functions I want to optimize are memory-bound or cpu-bound. In general I know, how to do this, but I have some questions about counting Floating Point Operations and analysing the size of data, which is used. Look at the following for-loop, which I want to analyse. The values of the array are doubles (that means 8 Byte each):
for(int j=0 ;j<N;j++){
for(int i=1 ;i<Nt;i++){
matrix[j*Nt+i] = matrix[j*Nt+i-1] * mu + matrix[j*Nt+i]*sigma;
}
}
1) How many floating point operations do you count? I thought about 3*(Nt-1)*N... but do I have to count the operations within the arrays, too (matrix[j*Nt+i], which are 2 more FLOP for this array)?
2)How much data is transfered? 2* ((Nt-1)*N)8Byte or 3 ((Nt-1)*N)*8Byte. I mean, every entry of the matrix has to be loaded. After the calculation, the new values is saved to that index of the array (now these is 1load and 1 store). But this value is used for the next calculation. Is another load operations needed therefore, or is this value (matrix[j*Nt+i-1]) already available without a load operation?
Thx a lot!!!

With this type of code, the direct sort of analysis you are proposing to do can be almost completely misleading. The only meaningful information about the performance of the code is actually measuring how fast it runs in practice (benchmarking).
This is because modern compilers and processors are very clever about optimizing code like this, and it will end up executing in a way which is nothing like your straightforward analysis. The compiler will optimize the code, rearranging the individual operations. The processor will itself try to execute the individual sub-operations in parallel and/or pipelined, so that for example computation is occurring while data is being fetched from memory.
It's useful to think about algorithmic complexity, to distinguish between O(n) and O(n²) and so on, but constant factors (like you ask about 2*... or 3*...) are completely moot because they vary in practice depending on lots of details.

Array access taking O(1) time improvable?

I've been reading a book assigned for class and it mentions that array access takes O(1) time. I realize that this is very fast (maybe as fast as possible), but if you have a loop that has to refer to this a few times, is there any advantage to assigning a temporary variable to have the value looked up in the array? Or would using the temporary variable still be O(1) to use as well?
I'm assuming this question is language independent. Also I realize that even if the answer is yes that the advantage is tiny, I'm just curious.

Note that O(1) doesn't mean "instantaneous." It just means "at most some constant." This means that both 1 and 101000 are both O(1), even though the second of these is bigger than the number of atoms in the universe.
If you are repeatedly accessing the same array element multiple times, it will take O(1) time for each access. Storing that array element in a local variable also gives O(1) lookup time, but the constants might not be the same. It might be better to pick one option over the other, but you'd really have to profile the program to be sure.
In practice, this sort of microoptimization is unlikely to have a measurable effect on program time unless the code you're running accounts for a huge fraction of the program's running time. I would be shocked to find an example where this change would make a noticeable impact in any real code.
Modern architectures probably might make this change a bit faster, but not dramatically so. If you keep accessing the same array element multiple times, the processor will probably keep that part of the array in cache, making lookups really fast. Also, a good optimizing compiler might already turn the non-local-copy code into the local copy code for you.
Hope this helps!

If I understand, you're asking if
for (int i=0; i<len; i++) {
int temp = ar[i];
foo += temp;
bar -= temp;
}
is any better than:
for (int i=0; i<len; i++) {
foo += ar[i];
bar -= ar[i];
}
I wouldn't worry about it:
If the code in the body of your loop is going to access the same array entry, say ar[i] multiple times, any halfway decent compiler (at a nonzero optimization level) will keep that value in a register for quick re-use. In other words, the compiler will probably generate the exact same assembly given the either of the above code samples.
Note that either of these is still O(1) (accessing one thing one time). Don't confuse big-O notation of algorithms with instruction-level optimizations.
Edit
I just compiled a sample program with two functions, containing the above two samples, and at -O2 gcc 4.7.2 generated the exact same machine code; byte-for-byte.

The only way you can perform better than O(1) time is to not have to do anything in the first place. That would be O(0) time.
Or with fewer words: No.

There are already things built into modern CPU hardware (cache lines for example) that do something like what you describe but better in a way that a temporary variable cannot do. Even better than that, no source modification is needed.

No. Array access is not some magical zero-footprint thing made out of sparkles and love. The algorithm to determine address from array indices in C can be seen here. The more dimensions you have on your array, the slower it gets to access, as additional operations (primarily muls and conditionals, in terms of cost) are required to arrive at the final, 1D memory address. Even if your array has just one dimension, you still have to calculate the offset on the base address, which is a single add operation, hence O(1).

Can I use Duff's Device on an array in C?

I have a loop here and I want to make it run faster. I am passing in a large array. I recently heard of Duff's Device can it be applied to this for loop? any ideas?
for (i = 0; i < dim; i++) {
for (j = 0; j < dim; j++) {
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}
}

Please, please don't use Duff's device. A thousand maintenance programmers will thank you. I used to work for a training company where someone thought it funny to introduce the device in the first ten pages of their C programming course. As an instructor it was impossible to deal with, unless (as the guy that that wrote that bit of the course apparently did) you believe in "kewl" coding.
Needless to say, I had the thing expunged from the course, ASAP.

Why do you want to make it run faster?
Is there an actual performance problem?
If so, have you profiled and found that this is executing often enough, and hence worth optimizing?
If so, you may want to write it in two ways, the straightforward way you have now and with Duff's Device, or any other method you like.
At that point, you test the performance. You may be surprised. Modern optimizers are quite good, and modern CPUs are really complicated, so source-level optimization is often counterproductive. (I once did this in a loop that was taking a whole lot of time, and found that tightening up the loop, even while introducing some indirection, improved performance. Your mileage is almost certainly going to vary.)
Finally, if Duff's Device is indeed faster, you have to decide whether the performance improvement is worth taking this straightforward and optimizable code and substituting a maintenance problem that may not improve performance at all in the next compiler version.

You should never unroll loops by hand. It would only give you a very platform-specific advantage, if any. All good compilers can unroll loops, but it's not even guaranteed to make the code faster, because it takes up more memory bandwidth to read a longer program from main memory.
If you want the loop to run fast, you should make sure that whatever RIDX computes, dst is accessed sequentially, so you minimize the number of cache misses. Other than that I can't see how you could make the loop faster.

Duff's Device is simply a technique for loop unrolling. And since any loop can be unrolled, you can use Duff's Device.

Were you able to figure this out and get a gain it would be a pittance and would in no way justify the complexity.
You would be better served spending your energies a level up--reconsidering your entire solution. Perhaps rather than copying values you could create a translation array and spend a little more time looking up answers indirectly when you need them (not really a good idea for building images--just trying to give you a different way to look at it).
Or maybe there is some completely different approach--look at your entire problem and try completely throwing away your current approaches and concepts and just see if there is something you haven't considered because you are too tied to this implementation.
Could your graphics card do some of this work?
Rethinking the problem at a high level works a lot more often than you might think.
Edit:
Looking at your sample more, it looks like you are taking a block of your image and copying it, pixel for pixel, to another image. If so, there are almost certainly ways to do it getting rid of the macro and copying byte for byte instead, or even using a block move assembly function then tweaking the edges of the result to match.
Or I may have guessed wrong, but chances are that looking at it on a larger scale than pixel for pixel might help you a lot more than unrolling loops.

The number of instruction cycles to implement the statement
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
will far outweigh the loop overhead, so unrolling the loop will be very little help on a percentage basis.

I believe this is a candidate for Duff's Device, depending on what the RIDX() function does. But I hope you don't expect someone to write the code for you... Also, you might want to format your code properly so it's actually readable.

Probably, as long as dim is a power of 2 or you have fast modulus on your target system. Learned something new today. I independently discovered that construct 5 years back and dropped it into our memCopy() routine. Who knew :)

Pedantically, no. Duff's Device was for writing to a hardware register (thus the target of the copy was always the same address).
You can implement something very much like Duff's Device for a copy like this, but there will be a clarity and maintenance cost. I'd first profile to make sure it's a problem. I'd also look into whether you can simplify the indexing, as that may enable the compiler to do the dirty work of unrolling the loop.

If you use it, make sure you measure it to determine that the improvement is both real , significant, and necessary in terms of your performance requirements. I doubt it will be.
For large loops, the remainder dealt with by Duff's device will be an insignificant proportion of the operation, and for small loops where the remainder is significant you will only see a benefit if you have many such loops (themselves in a loop), because small loops by definition don't take that long! Even then the compiler's optimiser is likely to do as well or better without rendering your code unreadable. It is also possible that the application of Duff's device will prevent the optimiser from applying more perhaps effective optimisations, which is why if you use it you need to measure it.
All the time you are likely to save on this (if any) you have probably wasted several times over reading responses to this question.

Duff's device may not be the optimized solution in an unrolled loop.
I had a function that sent a bit to a port, followed by a clock pulse to another port. For each bit, the functions were:
if (bit == 1)
{
write to the set port.
}
else
{
write to the clear port.
}
write high clock bit.
write low clock bit.
This was put into a Duff's device loop, along with bit shifting and bit count incrementing.
I improved the efficiency of the loop by using nibble values instead of bits (a nibble being 4 bits). The switch statement was based on the nibble value. This allowed 4 bits to be processed without any if statements, improving the flow through instruction cache (pipeline).
There are times when Duff's device may not be the optimal solution; but can be the foundation for a more efficient solution.

Modern compilers already do loop unrolling for you when optimizations are turned on, which renders Duff's device obsolete. The compiler knows better than you do the optimal level of unrolling for your compilation target, and you don't have to write any extra code to do it. It was a neat hack at the time, but these days Duff's device is just a historical curiosity, not a good programming practice.

In the end whoever makes the call on optimization everyone involved needs to be sure it is well documented and written in style that is as self documenting as possible using correctly spelled meaningful names for variables, functions etc. So it is obvious if the comments and the code get out of sync.
The need for optimization will never end. I was talking with a grad student that had broken malloc()/free() working on the largest file of genetic data ever attempted in one pass. After while the heap became too fragmented for malloc to to find a block of contiguous RAM to allocate to the calling function. He had to switch to a library that malloc only issued blocks of memory on 32k boundaries. It took 160% more the memory the old library, ran a good slower but it finished the job.
You must be careful using Duff's Device and many other optimizations to be sure the compiler does't optimize your optimization into obscure broken object code. As we enter an environment using automatic parallelizing tools this will become more of a problem.
I expect the lower the level the optimization the more likely future optimizations are to break the code. I can see that my habit of discarding line feeds in code designed to run on multiple platforms and putting the line feed back in in the print and write functions on each platform will run into problems in several of the things discussed in this thread.
-gcouger

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight