Computation Speed of Functions

Computation Speed of Functions - c

I have a quick question about optimizing the speed of my code.
Is it faster to return a double in a function or to have a pointer to your double as one of your arguments?
Returning double
double multiply(double x,double y) {
return x * y;
}
Pointer argument
void multiply(double x,double y,double *xy) {
*xy = x * y;
}

The sort of decision you're talking about here is often called a microoptimization - you're optimizing the code by making a tiny change rather than rethinking the overall strategy of the program. Typically, "microoptimization" has the connotation of "rewriting something in a less obvious way with the intent of squeezing out a little bit more performance." The language's explicit way of communicating data out of a function is through return values and programmers are used to seeing that for primitive types, so if you're going to go "off script" and use an outparameter like this, there should be a good reason to do so.
Are you absolutely certain that this code is executed so frequently that it's worth considering a rewrite that makes it harder to read in the interest of efficiency? A good optimizing compiler is likely to inline this function call anyway, so chances are there's not much of a cost at all. To make sure it's worth actually rewriting the code, you should start off by running a profiler and getting hard evidence that this particular code really is a bottleneck.
Once you've gathered evidence that this actually is slowing your program down, take a step and ask - why are you doing so many multiplications? Rather than doing a microoptimization, you might ask whether there's a different algorithm you could use that requires fewer multiplications. Changes like those are more likely to lead to performance increases.
Finally, if you're sure this is the bottleneck, and you're sure that there is no way to get rid of the multiplications, then you should ask whether this change would do anything. And for that, the only real way to find out would be to make the change and measure the difference. This change is so minor that most good compilers would be smart enough to realize what you're doing and optimize accordingly. As a result, I'd be astonished if you actually saw a performance increase here.
To summarize:
Readability is so much more valuable than efficiency that, as a starting point, it's worth using the "right" language feature for the job. Just use a return statement initially.
If you have hard, concrete evidence that the function you have is a bottleneck, see whether you actually need to call the function that many times by looking for bigger-picture ways of optimizing the code.
If you absolutely need to call the function that many times, then make the change and see if it works - but don't expect that it's actually going to make a difference because the change is minor and optimizing compilers are smart.

Related

Is there "compiler-friendly" code / convention [duplicate]

Many years ago, C compilers were not particularly smart. As a workaround K&R invented the register keyword, to hint to the compiler, that maybe it would be a good idea to keep this variable in an internal register. They also made the tertiary operator to help generate better code.
As time passed, the compilers matured. They became very smart in that their flow analysis allowing them to make better decisions about what values to hold in registers than you could possibly do. The register keyword became unimportant.
FORTRAN can be faster than C for some sorts of operations, due to alias issues. In theory with careful coding, one can get around this restriction to enable the optimizer to generate faster code.
What coding practices are available that may enable the compiler/optimizer to generate faster code?
Identifying the platform and compiler you use, would be appreciated.
Why does the technique seem to work?
Sample code is encouraged.
Here is a related question
[Edit] This question is not about the overall process to profile, and optimize. Assume that the program has been written correctly, compiled with full optimization, tested and put into production. There may be constructs in your code that prohibit the optimizer from doing the best job that it can. What can you do to refactor that will remove these prohibitions, and allow the optimizer to generate even faster code?
[Edit] Offset related link

Here's a coding practice to help the compiler create fast code—any language, any platform, any compiler, any problem:
Do not use any clever tricks which force, or even encourage, the compiler to lay variables out in memory (including cache and registers) as you think best. First write a program which is correct and maintainable.
Next, profile your code.
Then, and only then, you might want to start investigating the effects of telling the compiler how to use memory. Make 1 change at a time and measure its impact.
Expect to be disappointed and to have to work very hard indeed for small performance improvements. Modern compilers for mature languages such as Fortran and C are very, very good. If you read an account of a 'trick' to get better performance out of code, bear in mind that the compiler writers have also read about it and, if it is worth doing, probably implemented it. They probably wrote what you read in the first place.

Write to local variables and not output arguments! This can be a huge help for getting around aliasing slowdowns. For example, if your code looks like
void DoSomething(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
for (int i=0; i<numFoo, i++)
{
barOut.munge(foo1, foo2[i]);
}
}
the compiler doesn't know that foo1 != barOut, and thus has to reload foo1 each time through the loop. It also can't read foo2[i] until the write to barOut is finished. You could start messing around with restricted pointers, but it's just as effective (and much clearer) to do this:
void DoSomethingFaster(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
Foo barTemp = barOut;
for (int i=0; i<numFoo, i++)
{
barTemp.munge(foo1, foo2[i]);
}
barOut = barTemp;
}
It sounds silly, but the compiler can be much smarter dealing with the local variable, since it can't possibly overlap in memory with any of the arguments. This can help you avoid the dreaded load-hit-store (mentioned by Francis Boivin in this thread).

The order you traverse memory can have profound impacts on performance and compilers aren't really good at figuring that out and fixing it. You have to be conscientious of cache locality concerns when you write code if you care about performance. For example two-dimensional arrays in C are allocated in row-major format. Traversing arrays in column major format will tend to make you have more cache misses and make your program more memory bound than processor bound:
#define N 1000000;
int matrix[N][N] = { ... };
//awesomely fast
long sum = 0;
for(int i = 0; i < N; i++){
for(int j = 0; j < N; j++){
sum += matrix[i][j];
}
}
//painfully slow
long sum = 0;
for(int i = 0; i < N; i++){
for(int j = 0; j < N; j++){
sum += matrix[j][i];
}
}

Generic Optimizations
Here as some of my favorite optimizations. I have actually increased execution times and reduced program sizes by using these.
Declare small functions as inline or macros
Each call to a function (or method) incurs overhead, such as pushing variables onto the stack. Some functions may incur an overhead on return as well. An inefficient function or method has fewer statements in its content than the combined overhead. These are good candidates for inlining, whether it be as #define macros or inline functions. (Yes, I know inline is only a suggestion, but in this case I consider it as a reminder to the compiler.)
Remove dead and redundant code
If the code isn't used or does not contribute to the program's result, get rid of it.
Simplify design of algorithms
I once removed a lot of assembly code and execution time from a program by writing down the algebraic equation it was calculating and then simplified the algebraic expression. The implementation of the simplified algebraic expression took up less room and time than the original function.
Loop Unrolling
Each loop has an overhead of incrementing and termination checking. To get an estimate of the performance factor, count the number of instructions in the overhead (minimum 3: increment, check, goto start of loop) and divide by the number of statements inside the loop. The lower the number the better.
Edit: provide an example of loop unrolling
Before:
unsigned int sum = 0;
for (size_t i; i < BYTES_TO_CHECKSUM; ++i)
{
sum += *buffer++;
}
After unrolling:
unsigned int sum = 0;
size_t i = 0;
**const size_t STATEMENTS_PER_LOOP = 8;**
for (i = 0; i < BYTES_TO_CHECKSUM; **i = i / STATEMENTS_PER_LOOP**)
{
sum += *buffer++; // 1
sum += *buffer++; // 2
sum += *buffer++; // 3
sum += *buffer++; // 4
sum += *buffer++; // 5
sum += *buffer++; // 6
sum += *buffer++; // 7
sum += *buffer++; // 8
}
// Handle the remainder:
for (; i < BYTES_TO_CHECKSUM; ++i)
{
sum += *buffer++;
}
In this advantage, a secondary benefit is gained: more statements are executed before the processor has to reload the instruction cache.
I've had amazing results when I unrolled a loop to 32 statements. This was one of the bottlenecks since the program had to calculate a checksum on a 2GB file. This optimization combined with block reading improved performance from 1 hour to 5 minutes. Loop unrolling provided excellent performance in assembly language too, my memcpy was a lot faster than the compiler's memcpy. -- T.M.
Reduction of if statements
Processors hate branches, or jumps, since it forces the processor to reload its queue of instructions.
Boolean Arithmetic (Edited: applied code format to code fragment, added example)
Convert if statements into boolean assignments. Some processors can conditionally execute instructions without branching:
bool status = true;
status = status && /* first test */;
status = status && /* second test */;
The short circuiting of the Logical AND operator (&&) prevents execution of the tests if the status is false.
Example:
struct Reader_Interface
{
virtual bool write(unsigned int value) = 0;
};
struct Rectangle
{
unsigned int origin_x;
unsigned int origin_y;
unsigned int height;
unsigned int width;
bool write(Reader_Interface * p_reader)
{
bool status = false;
if (p_reader)
{
status = p_reader->write(origin_x);
status = status && p_reader->write(origin_y);
status = status && p_reader->write(height);
status = status && p_reader->write(width);
}
return status;
};
Factor Variable Allocation outside of loops
If a variable is created on the fly inside a loop, move the creation / allocation to before the loop. In most instances, the variable doesn't need to be allocated during each iteration.
Factor constant expressions outside of loops
If a calculation or variable value does not depend on the loop index, move it outside (before) the loop.
I/O in blocks
Read and write data in large chunks (blocks). The bigger the better. For example, reading one octect at a time is less efficient than reading 1024 octets with one read.
Example:
static const char Menu_Text[] = "\n"
"1) Print\n"
"2) Insert new customer\n"
"3) Destroy\n"
"4) Launch Nasal Demons\n"
"Enter selection: ";
static const size_t Menu_Text_Length = sizeof(Menu_Text) - sizeof('\0');
//...
std::cout.write(Menu_Text, Menu_Text_Length);
The efficiency of this technique can be visually demonstrated. :-)
Don't use printf family for constant data
Constant data can be output using a block write. Formatted write will waste time scanning the text for formatting characters or processing formatting commands. See above code example.
Format to memory, then write
Format to a char array using multiple sprintf, then use fwrite. This also allows the data layout to be broken up into "constant sections" and variable sections. Think of mail-merge.
Declare constant text (string literals) as static const
When variables are declared without the static, some compilers may allocate space on the stack and copy the data from ROM. These are two unnecessary operations. This can be fixed by using the static prefix.
Lastly, Code like the compiler would
Sometimes, the compiler can optimize several small statements better than one complicated version. Also, writing code to help the compiler optimize helps too. If I want the compiler to use special block transfer instructions, I will write code that looks like it should use the special instructions.

The optimizer isn't really in control of the performance of your program, you are. Use appropriate algorithms and structures and profile, profile, profile.
That said, you shouldn't inner-loop on a small function from one file in another file, as that stops it from being inlined.
Avoid taking the address of a variable if possible. Asking for a pointer isn't "free" as it means the variable needs to be kept in memory. Even an array can be kept in registers if you avoid pointers — this is essential for vectorizing.
Which leads to the next point, read the ^#$# manual! GCC can vectorize plain C code if you sprinkle a __restrict__ here and an __attribute__( __aligned__ ) there. If you want something very specific from the optimizer, you might have to be specific.

On most modern processors, the biggest bottleneck is memory.
Aliasing: Load-Hit-Store can be devastating in a tight loop. If you're reading one memory location and writing to another and know that they are disjoint, carefully putting an alias keyword on the function parameters can really help the compiler generate faster code. However if the memory regions do overlap and you used 'alias', you're in for a good debugging session of undefined behaviors!
Cache-miss: Not really sure how you can help the compiler since it's mostly algorithmic, but there are intrinsics to prefetch memory.
Also don't try to convert floating point values to int and vice versa too much since they use different registers and converting from one type to another means calling the actual conversion instruction, writing the value to memory and reading it back in the proper register set.

The vast majority of code that people write will be I/O bound (I believe all the code I have written for money in the last 30 years has been so bound), so the activities of the optimiser for most folks will be academic.
However, I would remind people that for the code to be optimised you have to tell the compiler to to optimise it - lots of people (including me when I forget) post C++ benchmarks here that are meaningless without the optimiser being enabled.

use const correctness as much as possible in your code. It allows the compiler to optimize much better.
In this document are loads of other optimization tips: CPP optimizations (a bit old document though)
highlights:
use constructor initialization lists
use prefix operators
use explicit constructors
inline functions
avoid temporary objects
be aware of the cost of virtual functions
return objects via reference parameters
consider per class allocation
consider stl container allocators
the 'empty member' optimization
etc

Attempt to program using static single assignment as much as possible. SSA is exactly the same as what you end up with in most functional programming languages, and that's what most compilers convert your code to to do their optimizations because it's easier to work with. By doing this places where the compiler might get confused are brought to light. It also makes all but the worst register allocators work as good as the best register allocators, and allows you to debug more easily because you almost never have to wonder where a variable got it's value from as there was only one place it was assigned.
Avoid global variables.
When working with data by reference or pointer pull that into local variables, do your work, and then copy it back. (unless you have a good reason not to)
Make use of the almost free comparison against 0 that most processors give you when doing math or logic operations. You almost always get a flag for ==0 and <0, from which you can easily get 3 conditions:
x= f();
if(!x){
a();
} else if (x<0){
b();
} else {
c();
}
is almost always cheaper than testing for other constants.
Another trick is to use subtraction to eliminate one compare in range testing.
#define FOO_MIN 8
#define FOO_MAX 199
int good_foo(int foo) {
unsigned int bar = foo-FOO_MIN;
int rc = ((FOO_MAX-FOO_MIN) < bar) ? 1 : 0;
return rc;
}
This can very often avoid a jump in languages that do short circuiting on boolean expressions and avoids the compiler having to try to figure out how to handle keeping
up with the result of the first comparison while doing the second and then combining them.
This may look like it has the potential to use up an extra register, but it almost never does. Often you don't need foo anymore anyway, and if you do rc isn't used yet so it can go there.
When using the string functions in c (strcpy, memcpy, ...) remember what they return -- the destination! You can often get better code by 'forgetting' your copy of the pointer to destination and just grab it back from the return of these functions.
Never overlook the oppurtunity to return exactly the same thing the last function you called returned. Compilers are not so great at picking up that:
foo_t * make_foo(int a, int b, int c) {
foo_t * x = malloc(sizeof(foo));
if (!x) {
// return NULL;
return x; // x is NULL, already in the register used for returns, so duh
}
x->a= a;
x->b = b;
x->c = c;
return x;
}
Of course, you could reverse the logic on that if and only have one return point.
(tricks I recalled later)
Declaring functions as static when you can is always a good idea. If the compiler can prove to itself that it has accounted for every caller of a particular function then it can break the calling conventions for that function in the name of optimization. Compilers can often avoid moving parameters into registers or stack positions that called functions usually expect their parameters to be in (it has to deviate in both the called function and the location of all callers to do this). The compiler can also often take advantage of knowing what memory and registers the called function will need and avoid generating code to preserve variable values that are in registers or memory locations that the called function doesn't disturb. This works particularly well when there are few calls to a function. This gets much of the benifit of inlining code, but without actually inlining.

I wrote an optimizing C compiler and here are some very useful things to consider:
Make most functions static. This allows interprocedural constant propagation and alias analysis to do its job, otherwise the compiler needs to presume that the function can be called from outside the translation unit with completely unknown values for the paramters. If you look at the well-known open-source libraries they all mark functions static except the ones that really need to be extern.
If global variables are used, mark them static and constant if possible. If they are initialized once (read-only), it's better to use an initializer list like static const int VAL[] = {1,2,3,4}, otherwise the compiler might not discover that the variables are actually initialized constants and will fail to replace loads from the variable with the constants.
NEVER use a goto to the inside of a loop, the loop will not be recognized anymore by most compilers and none of the most important optimizations will be applied.
Use pointer parameters only if necessary, and mark them restrict if possible. This helps alias analysis a lot because the programmer guarantees there is no alias (the interprocedural alias analysis is usually very primitive). Very small struct objects should be passed by value, not by reference.
Use arrays instead of pointers whenever possible, especially inside loops (a[i]). An array usually offers more information for alias analysis and after some optimizations the same code will be generated anyway (search for loop strength reduction if curious). This also increases the chance for loop-invariant code motion to be applied.
Try to hoist outside the loop calls to large functions or external functions that don't have side-effects (don't depend on the current loop iteration). Small functions are in many cases inlined or converted to intrinsics that are easy to hoist, but large functions might seem for the compiler to have side-effects when they actually don't. Side-effects for external functions are completely unknown, with the exception of some functions from the standard library which are sometimes modeled by some compilers, making loop-invariant code motion possible.
When writing tests with multiple conditions place the most likely one first. if(a || b || c) should be if(b || a || c) if b is more likely to be true than the others. Compilers usually don't know anything about the possible values of the conditions and which branches are taken more (they could be known by using profile information, but few programmers use it).
Using a switch is faster than doing a test like if(a || b || ... || z). Check first if your compiler does this automatically, some do and it's more readable to have the if though.

In the case of embedded systems and code written in C/C++, I try and avoid dynamic memory allocation as much as possible. The main reason I do this is not necessarily performance but this rule of thumb does have performance implications.
Algorithms used to manage the heap are notoriously slow in some platforms (e.g., vxworks). Even worse, the time that it takes to return from a call to malloc is highly dependent on the current state of the heap. Therefore, any function that calls malloc is going to take a performance hit that cannot be easily accounted for. That performance hit may be minimal if the heap is still clean but after that device runs for a while the heap can become fragmented. The calls are going to take longer and you cannot easily calculate how performance will degrade over time. You cannot really produce a worse case estimate. The optimizer cannot provide you with any help in this case either. To make matters even worse, if the heap becomes too heavily fragmented, the calls will start failing altogether. The solution is to use memory pools (e.g., glib slices ) instead of the heap. The allocation calls are going to be much faster and deterministic if you do it right.

A dumb little tip, but one that will save you some microscopic amounts of speed and code.
Always pass function arguments in the same order.
If you have f_1(x, y, z) which calls f_2, declare f_2 as f_2(x, y, z). Do not declare it as f_2(x, z, y).
The reason for this is that C/C++ platform ABI (AKA calling convention) promises to pass arguments in particular registers and stack locations. When the arguments are already in the correct registers then it does not have to move them around.
While reading disassembled code I've seen some ridiculous register shuffling because people didn't follow this rule.

Two coding technics I didn't saw in the above list:
Bypass linker by writing code as an unique source
While separate compilation is really nice for compiling time, it is very bad when you speak of optimization. Basically the compiler can't optimize beyond compilation unit, that is linker reserved domain.
But if you design well your program you can can also compile it through an unique common source. That is instead of compiling unit1.c and unit2.c then link both objects, compile all.c that merely #include unit1.c and unit2.c. Thus you will benefit from all the compiler optimizations.
It's very like writing headers only programs in C++ (and even easier to do in C).
This technique is easy enough if you write your program to enable it from the beginning, but you must also be aware it change part of C semantic and you can meet some problems like static variables or macro collision. For most programs it's easy enough to overcome the small problems that occurs. Also be aware that compiling as an unique source is way slower and may takes huge amount of memory (usually not a problem with modern systems).
Using this simple technique I happened to make some programs I wrote ten times faster!
Like the register keyword, this trick could also become obsolete soon. Optimizing through linker begin to be supported by compilers gcc: Link time optimization.
Separate atomic tasks in loops
This one is more tricky. It's about interaction between algorithm design and the way optimizer manage cache and register allocation. Quite often programs have to loop over some data structure and for each item perform some actions. Quite often the actions performed can be splitted between two logically independent tasks. If that is the case you can write exactly the same program with two loops on the same boundary performing exactly one task. In some case writing it this way can be faster than the unique loop (details are more complex, but an explanation can be that with the simple task case all variables can be kept in processor registers and with the more complex one it's not possible and some registers must be written to memory and read back later and the cost is higher than additional flow control).
Be careful with this one (profile performances using this trick or not) as like using register it may as well give lesser performances than improved ones.

I've actually seen this done in SQLite and they claim it results in performance boosts ~5%: Put all your code in one file or use the preprocessor to do the equivalent to this. This way the optimizer will have access to the entire program and can do more interprocedural optimizations.

Most modern compilers should do a good job speeding up tail recursion, because the function calls can be optimized out.
Example:
int fac2(int x, int cur) {
if (x == 1) return cur;
return fac2(x - 1, cur * x);
}
int fac(int x) {
return fac2(x, 1);
}
Of course this example doesn't have any bounds checking.
Late Edit
While I have no direct knowledge of the code; it seems clear that the requirements of using CTEs on SQL Server were specifically designed so that it can optimize via tail-end recursion.

Don't do the same work over and over again!
A common antipattern that I see goes along these lines:
void Function()
{
MySingleton::GetInstance()->GetAggregatedObject()->DoSomething();
MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingElse();
MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingCool();
MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingReallyNeat();
MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingYetAgain();
}
The compiler actually has to call all of those functions all of the time. Assuming you, the programmer, knows that the aggregated object isn't changing over the course of these calls, for the love of all that is holy...
void Function()
{
MySingleton* s = MySingleton::GetInstance();
AggregatedObject* ao = s->GetAggregatedObject();
ao->DoSomething();
ao->DoSomethingElse();
ao->DoSomethingCool();
ao->DoSomethingReallyNeat();
ao->DoSomethingYetAgain();
}
In the case of the singleton getter the calls may not be too costly, but it is certainly a cost (typically, "check to see if the object has been created, if it hasn't, create it, then return it). The more complicated this chain of getters becomes, the more wasted time we'll have.

Use the most local scope possible for all variable declarations.
Use const whenever possible
Dont use register unless you plan to profile both with and without it
The first 2 of these, especially #1 one help the optimizer analyze the code. It will especially help it to make good choices about what variables to keep in registers.
Blindly using the register keyword is as likely to help as hurt your optimization, It's just too hard to know what will matter until you look at the assembly output or profile.
There are other things that matter to getting good performance out of code; designing your data structures to maximize cache coherency for instance. But the question was about the optimizer.

Align your data to native/natural boundaries.

I was reminded of something that I encountered once, where the symptom was simply that we were running out of memory, but the result was substantially increased performance (as well as huge reductions in memory footprint).
The problem in this case was that the software we were using made tons of little allocations. Like, allocating four bytes here, six bytes there, etc. A lot of little objects, too, running in the 8-12 byte range. The problem wasn't so much that the program needed lots of little things, it's that it allocated lots of little things individually, which bloated each allocation out to (on this particular platform) 32 bytes.
Part of the solution was to put together an Alexandrescu-style small object pool, but extend it so I could allocate arrays of small objects as well as individual items. This helped immensely in performance as well since more items fit in the cache at any one time.
The other part of the solution was to replace the rampant use of manually-managed char* members with an SSO (small-string optimization) string. The minimum allocation being 32 bytes, I built a string class that had an embedded 28-character buffer behind a char*, so 95% of our strings didn't need to do an additional allocation (and then I manually replaced almost every appearance of char* in this library with this new class, that was fun or not). This helped a ton with memory fragmentation as well, which then increased the locality of reference for other pointed-to objects, and similarly there were performance gains.

A neat technique I learned from #MSalters comment on this answer allows compilers to do copy elision even when returning different objects according to some condition:
// before
BigObject a, b;
if(condition)
return a;
else
return b;
// after
BigObject a, b;
if(condition)
swap(a,b);
return a;

If you've got small functions you call repeatedly, i have in the past got large gains by putting them in headers as "static inline". Function calls on the ix86 are surprisingly expensive.
Reimplementing recursive functions in a non-recursive way using an explicit stack can also gain a lot, but then you really are in the realm of development time vs gain.

Here's my second piece of optimisation advice. As with my first piece of advice this is general purpose, not language or processor specific.
Read the compiler manual thoroughly and understand what it is telling you. Use the compiler to its utmost.
I agree with one or two of the other respondents who have identified selecting the right algorithm as critical to squeezing performance out of a program. Beyond that the rate of return (measured in code execution improvement) on the time you invest in using the compiler is far higher than the rate of return in tweaking the code.
Yes, compiler writers are not from a race of coding giants and compilers contain mistakes and what should, according to the manual and according to compiler theory, make things faster sometimes makes things slower. That's why you have to take one step at a time and measure before- and after-tweak performance.
And yes, ultimately, you might be faced with a combinatorial explosion of compiler flags so you need to have a script or two to run make with various compiler flags, queue the jobs on the large cluster and gather the run time statistics. If it's just you and Visual Studio on a PC you will run out of interest long before you have tried enough combinations of enough compiler flags.
Regards
Mark
When I first pick up a piece of code I can usually get a factor of 1.4 -- 2.0 times more performance (ie the new version of the code runs in 1/1.4 or 1/2 of the time of the old version) within a day or two by fiddling with compiler flags. Granted, that may be a comment on the lack of compiler savvy among the scientists who originate much of the code I work on, rather than a symptom of my excellence. Having set the compiler flags to max (and it's rarely just -O3) it can take months of hard work to get another factor of 1.05 or 1.1

When DEC came out with its alpha processors, there was a recommendation to keep the number of arguments to a function under 7, as the compiler would always try to put up to 6 arguments in registers automatically.

For performance, focus first on writing maintenable code - componentized, loosely coupled, etc, so when you have to isolate a part either to rewrite, optimize or simply profile, you can do it without much effort.
Optimizer will help your program's performance marginally.

You're getting good answers here, but they assume your program is pretty close to optimal to begin with, and you say
Assume that the program has been
written correctly, compiled with full
optimization, tested and put into
production.
In my experience, a program may be written correctly, but that does not mean it is near optimal. It takes extra work to get to that point.
If I can give an example, this answer shows how a perfectly reasonable-looking program was made over 40 times faster by macro-optimization. Big speedups can't be done in every program as first written, but in many (except for very small programs), it can, in my experience.
After that is done, micro-optimization (of the hot-spots) can give you a good payoff.

i use intel compiler. on both Windows and Linux.
when more or less done i profile the code. then hang on the hotspots and trying to change the code to allow compiler make a better job.
if a code is a computational one and contain a lot of loops - vectorization report in intel compiler is very helpful - look for 'vec-report' in help.
so the main idea - polish the performance critical code. as for the rest - priority to be correct and maintainable - short functions, clear code that could be understood 1 year later.

One optimization i have used in C++ is creating a constructor that does nothing. One must manually call an init() in order to put the object into a working state.
This has benefit in the case where I need a large vector of these classes.
I call reserve() to allocate the space for the vector, but the constructor does not actually touch the page of memory the object is on. So I have spent some address space, but not actually consumed a lot of physical memory. I avoid the page faults associated the associated construction costs.
As i generate objects to fill the vector, I set them using init(). This limits my total page faults, and avoids the need to resize() the vector while filling it.

One thing I've done is try to keep expensive actions to places where the user might expect the program to delay a bit. Overall performance is related to responsiveness, but isn't quite the same, and for many things responsiveness is the more important part of performance.
The last time I really had to do improvements in overall performance, I kept an eye out for suboptimal algorithms, and looked for places that were likely to have cache problems. I profiled and measured performance first, and again after each change. Then the company collapsed, but it was interesting and instructive work anyway.

I have long suspected, but never proved that declaring arrays so that they hold a power of 2, as the number of elements, enables the optimizer to do a strength reduction by replacing a multiply by a shift by a number of bits, when looking up individual elements.

Put small and/or frequently called functions at the top of the source file. That makes it easier for the compiler to find opportunities for inlining.

Is it good to use functions as much as possible?

When I read open source codes (Linux C codes), I see a lot functions are used instead of performing all operations on the main(), for example:
int main(void ){
function1();
return 0;
}
void function() {
// do something
function2();
}
void function2(){
function3();
//do something
function4();
}
void function3(){
//do something
}
void function4(){
//do something
}
Could you tell me what are the pros and cons of using functions as much as possible?
easy to add/remove functions (or new operations)
readability of the code
source efficiency(?) as the variables in the functions will be destroyed (unless dynamic allocation is done)
would the nested function slow the code flow?

Easy to add/remove functions (or new operations)
Definitely - it's also easy to see where does the context for an operation start/finish. It's much easier to see that way than by some arbitrary range of lines in the source.
Readability of the code
You can overdo it. There are cases where having a function or not having it does not make a difference in linecount, but does in readability - and it depends on a person whether it's positive or not.
For example, if you did lots of set-bit operations, would you make:
some_variable = some_variable | (1 << bit_position)
a function? Would it help?
Source efficiency(?) due to the variables in the functions being destroyed (unless dynamic allocation is done)
If the source is reasonable (as in, you're not reusing variable names past their real context), then it shouldn't matter. Compiler should know exactly where the value usage stops and where it can be ignored / destroyed.
Would the nested function slow the code flow?
In some cases where address aliasing cannot be properly determined it could. But it shouldn't matter in practice in most programs. By the time it starts to matter, you're probably going to be going through your application with a profiler and spotting problematic hotspots anyway.
Compilers are quite good these days at inlining functions though. You can trust them to do at least a decent job at getting rid of all cases where calling overhead is comparable to function length itself. (and many other cases)

This practice of using functions is really important as the amount of code you write increases. This practice of separating out to functions improves code hygiene and makes it easier to read. I read somewhere that there really is no point of code if it is only readable by you only (in some situations that is okay I'm assuming). If you want your code to live on, it must be maintainable and maintainability is one created by creating functions in the simplest sense possible. Also imagine where your code-base exceeds well over 100k lines. This is quite common and imagine having that all in the main function. That would be an absolute nightmare to maintain. Dividing the code into function helps create degrees of separability so many developers can work on different parts of the code-base. So basically short answer is yes, it is good to use functions when necessary.

Functions should help you structure your code. The basic idea is that when you identify some place in the code which does something that can be described in a coherent, self-contained way, you should think about putting it into a function.
Pros:
Code reuse. If you do many times some sequence of operations, why don't you write it once, use it many times?
Readability: it's much easier to understand strlen(st) than while (st[i++] != 0);
Correctness: the code in the previous line is actually buggy. If it is scattered around, you may probably not even see this bug, and if you will fix it in one place, the bug will stay somewhere else. But given this code inside a function named strlen, you will know what it should do, and you can fix it once.
Efficiency: sometimes, in certain situations, compilers may do a better job when compiling a code inside a function. You probably won't know it in advance, though.
Cons:
Splitting a code into functions just because it is A Good Thing is not a good idea. If you find it hard to give the function a good name (in your mother language, not only in C) it is suspicious. doThisAndThat() is probably two functions, not one. part1() is simply wrong.
Function call may cost you in execution time and stack memory. This is not as severe as it sounds, most of the time you should not care about it, but it's there.
When abused, it may lead to many functions doing partial work and delegating other parts from here to there. too many arguments may impede readability too.
There are basically two types of functions: functions that do a sequence of operations (these are called "procedures" in some contexts), and functions that does some form of calculation. These two types are often mixed in a single function, but it helps to remember this distinction.
There is another distinction between kinds of functions: Those that keep state (like strtok), those that may have side effects (like printf), and those that are "pure" (like sin). Function like strtok are essentially a special kind of a different construct, called Object in Object Oriented Programming.

You should use functions that perform one logical task each, at a level of abstraction that makes the function of each function easy to logically verify. For instance:
void create_ui() {
create_window();
show_window();
}
void create_window() {
create_border();
create_menu_bar();
create_body();
}
void create_menu_bar() {
for(int i = 0; i < N_MENUS; i++) {
create_menu(menus[i]);
}
assemble_menus();
}
void create_menu(arg) {
...
}
Now, as far as creating a UI is concerned, this isn't quite the way one would do it (you would probably want to pass in and return various components), but the logical structure is what I'm trying to emphasize. Break your task down into a few subtasks, and make each subtask its own function.
Don't try to avoid functions for optimization. If it's reasonable to do so, the compiler will inline them for you; if not, the overhead is still quite minimal. The gain in readability you get from this is a great deal more important than any speed you might get from putting everything in a monolithic function.
As for your title question, "as much as possible," no. Within reason, enough to see what each function does at a comfortable level of abstraction, no less and no more.

One condition you can use: if part of the code will be reuse/rewritten, then put it in a function.

I guess I think of functions like legos. You have hundreds of small pieces that you can put together into a whole. As a result of all of those well designed generic, small pieces you can make anything. If you had a single lego that looked like an entire house you couldn't then use it to build a plane, or train. Similarly, one huge piece of code is not so useful.
Functions are your bricks that you use when you design your project. Well chosen separation of functionality into small, easily testable, self contained "functions" makes building and looking after your whole project easy. Their benefits WAYYYYYYY out-weigh any possible efficiency issues you may think are there.
To be honest, the art of coding any sizeable project is in how you break it down into smaller pieces, so functions are key to that.

condition vs division

given the following statement which is executed a lot:
iNormVal = iVal / uRatio;
would the following make more sense (performance wise) if uRatio == 1 most (90%) of the time?
if(uRatio > 1)
iNormVal = iVal / uRatio;
else
iNormVal = iVal;
thanks..

Since you spotted this as a potential bottleneck, it's very likely this spot is totally irrelevant for your app's overall speed. Seriously, humans, even guru programmers, are notoriously bad at spotting real bottlenecks. (The difference is that good programmers admit and preach that, while juniors keep spending time for optimizing irrelevant spots.)
Generally I found this approach to optimizations most helpful:
If speed is a major concern, schedule considerable time for optimizations to be done before releasing your app.
Design your code so that it doesn't sport inherent pessimizations.
Implement it the way it's easiest to understand the code. Prevent obvious pessimizations (like passing parameters by value instead of reference), but don't get overexcited.
Check if it is too slow. If so profile the app and identify the hot spots.
Put resources into optimizing (and thereby potentially obfuscating) those hot spots only, iteratively profiling to check which changes help.
Stop when the app is fast enough.
(It's different for library code, obviously, but these few steps would carry you a long way.)

You need to profile this to get a measurement, it's too hard to guess. The compiler might decide you're wrong and remove the test, so check with and without optimizing.
The actual cost of an (integer) division might be rather low, especially on modern desktop-class processors. According to this PDF, the costs on modern (Wolfdale/Nehalem/Sandy Bridge) of a 32/32-bit division are 14-23/17-28/20-28 cycles respectively. So, if you really do this a lot, it might add up. In that case, look into parallel (vectorized) options if possible.
I would try to avoid it if at all possible, since it introduces a branch. Branches have two disadvantages: they make the code more complex by introducing multiple paths that the programmer reading the code has to understand, and they can also introduce execution overhead.

It depends.
Is the code in a performance critical application? If so then it may help perf wise. If not well then I would usually err on the side or readibility and not introduce the extra if statement.
Even if it is in a performance critical application, it is usually the external boundary interactions that account for 95% of the perf time such as interactions with databases or external services. Compilers usually execute very quickly and if statements are very cheap. When we usually profile our code, it is rare that we would make a change such as what you have described for perf reasons only. Misuse of looping and the like may sometimes prop up but we rarely add if statements like described.
Hope this helps...

If you decide to go with branching then you could check first for the common case. It is slightly more readable and should be slightly better performance wise.
if(uRatio <= 1) {
iNormVal = iVal;
}
else {
iNormVal = iVal / uRatio;
}
To be more readable you could add a local variable with a good name that holds the result of the expression.
unsigned int uSmallRatio = uRatio <= 1;
if(uSmallRatio) {
iNormVal = iVal;
}
else {
iNormVal = iVal / uRatio;
}
The compiler could optimize this into the same machine code as the first approach. I'm not sure about this though.
Similarly you could do this but it is not pretty:
iNormVal = uRatio <= 1 ? iVal : iVal / uRatio;
Finally another approach would be:
iNormVal = iVal;
if(uRatio > 1) { /*explain why you do this so it won't be changed by somebody else*/
iNormVal = iVal / uRatio;
}
I'm sure there are other approaches to consider.
Regards...

The if clause will actually most likely make the program slower. Branching is really bad for performance because modern processors are pipelined, and branches prevent the pipeline from being fully effective. This is such a significant issue that considerable effort goes into branch prediction, but that's not going to help in this case. Even if the prediction is right 90% of the time, that means an empty pipeline 10% of the time, which is a lot worse than an int division (expecially when taking into account that the if clause itself takes time).
But most likely it does not matter at all because your code spends most of its time in a completely different place, making this whole question a huge waste of time.

Most performance issues are either ideological (you designed way wrong), or implementation of a proven slower algorithm (given choices).
Beyond that, performance gains are going to be at the assembly level, and will be platform dependent.
I can hardly recommend this as an actual concern for performance, unless you're really strapped for performance, at which point you need to go check the above first.
All you've done is raise the eyebrows of the person that will maintain your code. Hope you have a lazy programmer that leaves stuff alone, or you'll end up losing this code anyway.
Because you have no guarantees at this level, of how code will perform on different platforms given different compiler, compiler options, and optimizations, you may even lose the code to compiler optimization. It's best to focus on larger issues.

Which of the following would be more efficient?

In C:
Lets say function "Myfuny()" has 50 line of codes in which other smaller functions also get called. Which one of the following code would be more efficient?
void myfunction(long *a, long *b);
int i;
for(i=0;i<8;i++)
myfunction(&a, &b);
or
myfunction(&a, &b);
myfunction(&a, &b);
myfunction(&a, &b);
myfunction(&a, &b);
myfunction(&a, &b);
myfunction(&a, &b);
myfunction(&a, &b);
myfunction(&a, &b);
any help would be appreciated.

That's premature optimization, you just shouldn't care...
Now, from a code maintenance point of view the first form (with the loop) is definitely better.
From a run-time point of view and if the function is inline and defined in the same compilation unit, and with a compiler that does not unroll the loop itself, and if code is already in instruction cache (I don't know for moon phases, I still believe it shouldn't have any noticable effect) the second one may be marginally fastest.
As you can see, there is many conditions for it to be fastest, so you shouldn't do that. There is probably many other parameters to optimize in your program that would have a much greater effect for code speed than this one. Any change that would affect algorithmic complexity of the program will have a much greater effect. More generally speaking any code change that does not affect algorithmic complexity is probably premature optimization.
If you really want to be sure, measure. On x86 you can use the kind of trick I used in this question to get a fairly accurate measure. The trick is to read a processor register that count the number of cycles spent. The question also illustrate how code optimization questions can become tricky, even for very simple problems.

I'd assume the compiler will translate the first variant into the second.

The first. Any have half-decent compiler will optimize that for you. It's easier to read/understand and easier to write.
Secondly, write first, optimize second. Even if your compiler was completely brain dead and retarded, it at best would only save you a few nano/ms seconds on a modern CPU. Chances are there are bigger bottlenecks in your applications that could/should be optimized first.

It depends on so many things your best bet is to do it both ways and measure.

It would take less (of your) time to write out the for loop. I'd also say it's clearer to read with the loop. It would probably save a few instructions to write them out, but with modern processors and compilers it may amount to exactly the same result...

The first. It is easier to read.

First, are you sure you have a code execution performance problem? If you don't, then you're talking about making your code less readable and writable for no reason at all.
Second, have you profiled your program to see if this is in a place where it will take a significant amount of time? Humans are very bad at guessing the hot spots in programs, and without profiling you're likely to spend time and effort fiddling with things that don't make a difference.
Third, are you going to check the assembler code produced to see if there's a difference? If you're using an optimizing compiler with optimizations on, it's likely to produce what it sees fit for either. If you aren't, and you have a performance problem, get a better computer or turn on more optimizations.
Fourth, if there is a difference, are you going to test both ways to see which is better? On at least a representative sample of the systems your users will be running on?
And, to give you my best answer to which is more efficient: it depends. If they're in fact compiled to different code, the unrolled version might be faster because it doesn't have the loop overhead (which includes a conditional branch), and the rolled-up version might be faster because it's shorter code and will work better in the instruction cache. The usual wisdom was to unroll, but I once sped up a long-running section by rolling the execution up as tightly as I could.

On modern processors the size of compiled code becomes very importand. If this loop could run entirly from processor's cache it would be the fastest solution. As n8wrl said test yourself.

I created a short test for this, with surprising results. At least for me, anyway, I would've thought it was the other way round.
So, I wrote two versions of a program iterating over a function nothing(), that did nothing interesting (inc on a variable).
The first used proper loops (a million iterations of 1000 iterations, two nested fors), the second one did a million iterations of 1000 consecutive calls to nothing().
I used the time command to measure. The version with the proper loop took about 3.5 seconds on average, and the consecutive calling version took about 2.5 seconds on average.
I then tried to compile with optimization flags, but gcc detected that the program did essentially nothing and execution was instantaneous on both versions =P. Didn't bother fixing that.
Edit: if you were actually thinking of writing 8 consecutive calls in your code, please don't. Remember the famous quote: "Programs must be written for people to read, and only incidentally for machines to execute.".
Also note that my tests did nothing except nothing() (=P) and are no proper benchmarks to consider in any actual program.

Loop unrolling can make execution faster (otherwise Duff's Device wouldn't have been invented), but that's a function of so many variables (processor, cache size, compiler settings, what myfunction is actually doing, etc.) that you can't rely on it to always be true, or for whatever improvement to be worth the cost in readability and maintainability. The only way to know for sure if it makes a difference for your particular platform is to code up both versions and profile them.
Depending on what myfunction actually does, the difference could be so far down in the noise as to be undetectable.
This kind of micro-optimization should only be done if all of the following are true:
You're failing to meet a hard performance requirement;
You've already picked the proper algorithm and data structure for the problem at hand (e.g., in the average case a poorly optimized Quicksort will beat the pants off of a highly optimized bubble sort, and in the worst case they'll be equally bad);
You're compiling with the highest level of optimization that the compiler offers;

How much does myFunction(long *a, long *b) do?
If it does much more than *a = *b + 1; the cost of calling the function can be so small compared to what goes on inside the function that you are really focussing in the wrong place.
On the other hand, in the overall picture of your application program, what percent of time is spent in these 8 calls? If it's not very much, then it won't make much difference no matter how tightly you optimize it.
As others say, profile, but that's not necessarily as simple as it sounds. Here's the method I and some others use.

Micro-optimizations in C, which ones are there? Is there anyone really useful? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I understand most of the micro-optimizations out there but are they really useful?
Exempli gratia: does doing ++i instead of i++, or while(1) or for(;;) really result in performance improvements (either in memory fingerprint or CPU cycles)?
So the question is, what micro-optimizations can be done in C? Are they really useful?

You should rely on your compiler to optimise this stuff. Concentrate on using appropriate algorithms and writing reliable, readable and maintainable code.

The day tclhttpd, a webserver written in Tcl, one of the slowest scripting language, managed to outperform Apache, a webserver written in C, one of the supposedly fastest compiled language, was the day I was convinced that micro-optimizations significantly pales in comparison to using a faster algorithm/technique*.
Never worry about micro-optimizations until you can prove in a debugger that it is the problem. Even then, I would recommend first coming here to SO and ask if it is a good idea hoping someone would convince you not to do it.
It is counter-intuitive but very often code, especially tight nested loops or recursion, are optimized by adding code rather than removing them. The gaming industry has come up with countless tricks to speed up nested loops using filters to avoid unnecessary processing. Those filters add significantly more instructions than the difference between i++ and ++i.
*note: We have learned a lot since then. The realization that a slow scripting language can outperform compiled machine code because spawning threads is expensive led to the developments of lighttpd, NginX and Apache2.

There's a difference, I think, between a micro-optimization, a trick, and alternative means of doing something. It can be a micro-optimization to use ++i instead of i++, though I would think of it as merely avoiding a pessimization, because when you pre-increment (or decrement) the compiler need not insert code to keep track of the current value of the variable for use in the expression. If using pre-increment/decrement doesn't change the semantics of the expression, then you should use it and avoid the overhead.
A trick, on the other hand, is code that uses a non-obvious mechanism to achieve a result faster than a straight-forward mechanism would. Tricks should be avoided unless absolutely needed. Gaining a small percentage of speed-up is generally not worth the damage to code readability unless that small percentage reflects a meaningful amount of time. Extremely long-running programs, especially calculation-heavy ones, or real-time programs are often candidates for tricks because the amount of time saved may be necessary to meet the systems performance goals. Tricks should be clearly documented if used.
Alternatives, are just that. There may be no performance gain or little; they just represent two different ways of expressing the same intent. The compiler may even produce the same code. In this case, choose the most readable expression. I would say to do so even if it results in some performance loss (though see the preceding paragraph).

I think you do not need to think about these micro-optimizations because most of them is done by compiler. These things can only make code more difficult to read.
Remember, [edited] premature [/edited] optimization is an evil.

To be honest, that question, while valid, is not relevant today - why?
Compiler writers are a lot more smarter than they were 20 years ago, rewind back in time, then these optimizations would have been very relevant, we were all working with old 80286/386 processors, and coders would often resort to tricks to squeeze even more bytes out of the compiled code.
Today, processors are too fast, compiler writers knows the intimate details of operand instructions to make every thing work, considering that there is pipe-lining, core processors, acres of RAM, remember, with a 80386 processor, there would be 4Mb RAM and if you're lucky, 8Mb was considered superior!!
The paradigm has shifted, it was about squeezing every byte out of compiled code, now it is more on programmer productivity and getting the release out the door much sooner.
The above I have stated the nature of the processor, and compilers, I was talking about the Intel 80x86 processor family, Borland/Microsoft compilers.
Hope this helps,
Best regards,
Tom.

If you can easily see that two different code sequences produce identical results, without making assumptions about the data other than what's present in the code, then the compiler can too, and generally will.
It's only when the transformation from one to the other is highly non-obvious or requires assuming something that you may know to be true but the compiler has no way to infer (eg. that an operation cannot overflow or that two pointers will never alias, even though they aren't declared with the restrict keyword) that you should spend time thinking about these things. Even then, the best thing to do is usually to find a way to inform the compiler about the assumptions that it can make.
If you do find specific cases where the compiler misses simple transformations, 99% of the time you should just file a bug against the compiler and get on with working on more important things.

Keeping the fact that memory is the new disk in mind will likely improve your performance far more than applying any of those micro-optimizations.

For a slightly more pragmatic take on the question of ++i vs. i++ (at least in a C++ context) see http://llvm.org/docs/CodingStandards.html#micro_preincrement.
If Chris Lattner says it, I've got to pay attention. ;-)

You would do better to consider every program you write primarily as a language in which you communicate your ideas, intentions and reasoning to other human beings who will have to bug-fix, reuse and understand it. They will spend more time on decoding garbled code than any compiler or runtime system will do executing it.
To summarise, say what you mean in the clearest way, using the common idioms of the language in question.
For these specific examples in C, for(;;) is the idiom for an infinite loop and "i++" is the usual idiom for "add one to i" unless you use the value in an expression, in which case it depends whether the value with the clearest meaning is the one before or after the increment.

Here's real optimization, in my experience.
Someone on SO once remarked that micro-optimization was like "getting a haircut to lose weight". On American TV there is a show called "The Biggest Loser" where obese people compete to lose weight. If they were able to get their body weight down to a few grams, then getting a haircut would help.
Maybe that's overstating the analogy to micro-optimization, because I have seen (and written) code where micro-optimization actually did make a difference, but when starting off there is a lot more to be gained by simply not solving problems you don't have.

x ^= y
y ^= x
x ^= y

++i should be prefered over i++ for situations where you don't use the return value because it better represents the semantics of what you are trying to do (increment i) rather than any possible optimisation (it might be slightly faster, and is probably not worse).

Generally, loops that count towards zero are faster than loops that count towards some other number. I can imagine a situation where the compiler can't make this optimization for you, but you can make it yourself.
Say that you have and array of length x, where x is some very big number, and that you need to perform some operation on each element of x. Further, let's say that you don't care what order these operations occur in. You might do this...
int i;
for (i = 0; i < x; i++)
doStuff(array[i]);
But, you could get a little optimization by doing it this way instead -
int i;
for (i = x-1; i != 0; i--)
{
doStuff(array[i]);
}
doStuff(array[0]);
The compiler doesn't do it for you because it can't assume that order is unimportant.
MaR's example code is better. Consider this, assuming doStuff() returns an int:
int i = x;
while (i != 0)
{
--i;
printf("%d\n",doStuff(array[i]));
}
This is ok as long as printing the array contents in reverse order is acceptable, but the compiler can't decide that for you.
This being an optimization is hardware dependent. From what I remember about writing assembler (many, many years ago), counting up rather than counting down to zero requires an extra machine instruction each time you go through the loop.
If your test is something like (x < y), then evaluation of the test goes something like this:
subtract y from x, storing the result in some register r1
test r1, to set the n and z flags
branch based on the values of the n and z flags
If your test is ( x != 0), you can do this:
test x, to set the z flag
branch based on the value of the z flag
You get to skip a subtract instruction for each iteration.
There are architectures where you can have the subtract instruction set the flags based on the result of the subtraction, but I'm pretty sure x86 isn't one of them, so most of us aren't using compilers that have access to such a machine instruction.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight