Faster way to copy C array with calculation between

Faster way to copy C array with calculation between - c

I want to copy an C array data to another, but with a calculation between (i.e. not just copying the same content from one to another, but having a modification in the data):
int aaa;
int src[ARRAY_SIZE];
int dest[ARRAY_SIZE];
//fill src with data
for (aaa = 0; aaa < ARRAY_SIZE; aaa++)
{
dest[aaa] = src[aaa] * 30;
}
This is done in buffers of size 520 or higher, so the for loop is considerable.
Is there any way to improve performance here in what comes to coding?
I did some research on the topic, but I couldn't find anything specific about this case, only about simple copy buffer to buffer (examples: here, here and here).
Environment: GCC for ARM using Embedded Linux. The specific code above, though, is used inside a C project running inside a dedicated processor for DSP calculations. The general processor is an OMAP L138 (the DSP processor is included in the L138).

You could try techniques such as loop-unrolling or duff's device, but if you switch on compiler optimisation it will probably do that for you in any case if it is advantageous without making your code unreadable.
The advantage of relying on compiler optimisation is that it is architecture specific; a source-level technique that works on one target may not work so well on another, but compiler generated optimisations will be specific to the target. For example there is no way to code specifically for SIMD instructions in C, but the compiler may generate code to take advantage of them, and to do that, it is best to keep the code simple and direct so that the compiler can spot the idiom. Writing weird code to "hand optimise" can defeat the optimizer and stop it doing its job.
Another possibility that may be advantageous on some targets (if you are only ever coding for desktop x86 targets, this may be irrelevant), is to avoid the multiply instruction by using shifts:
Given that x * 30 is equivalent to x * 32 - x * 2, the expression in the loop can be replaced with:
input[aaa] = (output[aaa] << 5) - (output[aaa] << 1) ;
But again the optimizer may well do that for you; it will also avoid the repeated evaluation of output[aaa], but if that were not the case, the following may be beneficial:
int i = output[aaa] ;
input[aaa] = (i << 5) - (i << 1) ;
The shifting technique is likely to be more advantageous for division operations which are far more expensive on most targets, and it is only applicable to constants.
These techniques are likely to improve the performance of unoptimized code, but compiler optimisations will likely do far better, and the original code may optimise better than "hand optimised" code.
In the end if it is important, you have to experiment and perform timing tests or profiling.

Related

Optimize c code by code motion

optimizeMe (const char* string0, const char* string1)
{
int i0;
int i1 = strlen(string1) - 1;
int count = 0;
for (i0 = 0; i0 < strlen(string0); i0++)
{
if (toupper(string0[i0]) == toupper(string1[i1]))
count++;
count++;
if ((i0%32)==0)
i1--;
}
return(count / 8);
}
I know I can optimize this code by using register, gcc -o2, reduction in strength i0%32=0x10000, and common expression count/8 = count >> 3, etc;
However, how can I optimize them by code motion? Specifically for if statement and il--.
Any hints are appreciated !

As Lundin pointed out in the comments, these are premature optimisations. They're premature because you're clearly just guessing, and not using a profiler to test your theories on real-world programs with real-world bottlenecks.
They're also micro-optimisations, and we haven't really needed to micro-optimise in C since the 80s, thanks to significant breakthroughs in technology that allow us to do amazing things like mocking up three-dimensional realms in real-time, for example.
gcc already has support for various feature such as dead code elimination, code hoisting (even into compile-time in some cases) and profile-guided optimisation which you might be interested in. I think we take for granted the fact that we have a compiler which can statically deduce when code is unreachable, all by its lonesome; that's a quite complex optimisation for a machine to perform.
By profiling as you test, and then recompiling, feeding the profiler data back into the compiler, the compiler obtains information about how to rearrange branches to be better predicted. That's profile-guided optimisation, and it's semi-automated. I wonder what the authors of Wolfenstein 3D would've done for this kind of technology...
Speaking of profilers, if I may suggest that you test these in some realistic usecases (i.e. actual programs that have active development and a large community):
using register
reduction in strength i%32=0x10000
count/8 = count >> 3
That last optimisation isn't even correct for you (see exercise 5), by the way... This might be a good time to mention the other debugging tools we have in our suite. I can highly recommend checking out ASan (and the other UBSans) for example, will likely save you hours of debugging one day.
It might be best to use size_t since that's what strlen returns for a start, size_t is more portable for use with strlen and quite possibly faster too (due to the fact that size_t has no sign bit and so no potential sign handling overhead when you write things like for (size_t x = 0; x < y; x++))...
... or, if you want provide to your architecture-specific hints to your compiler (which presumably has no profile-guided optimisation, or else you wouldn't need to manually do that), you could also use uint_fast32_t or something else that isn't really suitable for the task, but is still vastly more suitable than int.
I gather you must be getting input from somewhere, or else your program is "pure", in the (functional) sense that it has no side-effects that change the way the user interacts with the system at all (in that case, your compiler might even hoist all of your logic into compile-time evaluation)... Have you considered adjusting the buffer sizes of whichever files and/or streams (i.e. stdin and stdout) you're using? You ought to be able to use setvbuf to do that... If you have many streams open at once, you might want to try choosing a smaller stream buffer size in order to keep all of your stuff in cache. I like to start off with a buffer size of 1, and work my way up from there, that way you'll see precisely where the bottleneck for your system is... To be clear, if you were to unroll loops (which gcc will happily do automatically if it's beneficial)...
If you're using a really primitive compiler (which you're not, though profilers are honestly likely to guide you straight to the heftiest optimisations in any case, optimising your time as a developer), you might be able to suggest to the compiler to emit non-branching code for these lines:
// consider `count += (toupper((unsigned char) string[i0]) == toupper((unsigned char) string1[i1]));`?
if (toupper(string0[i0]) == toupper(string1[i1]))
count++;
The casts are necessary to prevent crashes in certain circumstances, by the way... you absolutely need to make sure the only values you pass to a <ctype.h> function are unsigned char values, or EOF.
// consider using `i -= !(i0 % 32);`?
if ((i0%32)==0)
i1--;

Should I optimize my code myself or let the compiler/gcc to do it

I am writing a c code and I would like to know if making simple operation like multiplication more CPU friendly makes any difference and the code faster. For example, replacing this line of code:
y = x * 15;
with
y = x << 4;
y -= x;
Does the compiler already do that? Should I use the -O2 option in order to make it happen?

There are two parts to the answer:
No, unless you are writing a very specialized function (e.g. a signal processing function that must execute in 20 clocks) you should not optimize; leave that to the compiler. In general your job is to write readable code, and the compiler will (to it's capability optimize it). Note that the optimization will be different for different processors as their hardware (computational capability) can be very different. For example a shift by N instruction (like that in your code) may take N clocks on a processor with a regular shifter, but it will take a single clock (or less) on processors with a hardware barrel shifter.
Yes, most modern optimizing compilers will optimize (for example replace multiplication by shifting where appropriate) without explicit optimization options.
Summarized, optimize only in rare situations when you already know that the compiler didn't do a good job, it is a problem that must be addressed, you know well how to do better than the compiler, and the resulting increased maintenance cost is worth it.

Optimizing code by hand is almost always an exercise in futility these days, especially in higher level languages. While C is almost assembly, modern compilers have a lot more tricks built into them than most people are aware of.
In addition, unless the code you're optimizing is going to be used a lot, i.e., millions of times in close succession, the work of optimizing the code will cost more time than the savings you achieve.
With that said, the only way to see if your code is measurably faster would be to test it: Put each version in a tight loop and execute it a million (or more) times, and see if there's a noticeable difference.
Note that your optimization is for a specific multiplier - any other operand you use it for is going to yield different results. Because it can't be generalized, there's little likelyhood this optimization will be done by any compiler in all cases - and just looking at the code and not knowing what processor architecture it will be run on, I can't say whether it would be faster or not.

Is there "compiler-friendly" code / convention [duplicate]

Many years ago, C compilers were not particularly smart. As a workaround K&R invented the register keyword, to hint to the compiler, that maybe it would be a good idea to keep this variable in an internal register. They also made the tertiary operator to help generate better code.
As time passed, the compilers matured. They became very smart in that their flow analysis allowing them to make better decisions about what values to hold in registers than you could possibly do. The register keyword became unimportant.
FORTRAN can be faster than C for some sorts of operations, due to alias issues. In theory with careful coding, one can get around this restriction to enable the optimizer to generate faster code.
What coding practices are available that may enable the compiler/optimizer to generate faster code?
Identifying the platform and compiler you use, would be appreciated.
Why does the technique seem to work?
Sample code is encouraged.
Here is a related question
[Edit] This question is not about the overall process to profile, and optimize. Assume that the program has been written correctly, compiled with full optimization, tested and put into production. There may be constructs in your code that prohibit the optimizer from doing the best job that it can. What can you do to refactor that will remove these prohibitions, and allow the optimizer to generate even faster code?
[Edit] Offset related link

Here's a coding practice to help the compiler create fast code—any language, any platform, any compiler, any problem:
Do not use any clever tricks which force, or even encourage, the compiler to lay variables out in memory (including cache and registers) as you think best. First write a program which is correct and maintainable.
Next, profile your code.
Then, and only then, you might want to start investigating the effects of telling the compiler how to use memory. Make 1 change at a time and measure its impact.
Expect to be disappointed and to have to work very hard indeed for small performance improvements. Modern compilers for mature languages such as Fortran and C are very, very good. If you read an account of a 'trick' to get better performance out of code, bear in mind that the compiler writers have also read about it and, if it is worth doing, probably implemented it. They probably wrote what you read in the first place.

Write to local variables and not output arguments! This can be a huge help for getting around aliasing slowdowns. For example, if your code looks like
void DoSomething(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
for (int i=0; i<numFoo, i++)
{
barOut.munge(foo1, foo2[i]);
}
}
the compiler doesn't know that foo1 != barOut, and thus has to reload foo1 each time through the loop. It also can't read foo2[i] until the write to barOut is finished. You could start messing around with restricted pointers, but it's just as effective (and much clearer) to do this:
void DoSomethingFaster(const Foo& foo1, const Foo* foo2, int numFoo, Foo& barOut)
{
Foo barTemp = barOut;
for (int i=0; i<numFoo, i++)
{
barTemp.munge(foo1, foo2[i]);
}
barOut = barTemp;
}
It sounds silly, but the compiler can be much smarter dealing with the local variable, since it can't possibly overlap in memory with any of the arguments. This can help you avoid the dreaded load-hit-store (mentioned by Francis Boivin in this thread).

The order you traverse memory can have profound impacts on performance and compilers aren't really good at figuring that out and fixing it. You have to be conscientious of cache locality concerns when you write code if you care about performance. For example two-dimensional arrays in C are allocated in row-major format. Traversing arrays in column major format will tend to make you have more cache misses and make your program more memory bound than processor bound:
#define N 1000000;
int matrix[N][N] = { ... };
//awesomely fast
long sum = 0;
for(int i = 0; i < N; i++){
for(int j = 0; j < N; j++){
sum += matrix[i][j];
}
}
//painfully slow
long sum = 0;
for(int i = 0; i < N; i++){
for(int j = 0; j < N; j++){
sum += matrix[j][i];
}
}

Generic Optimizations
Here as some of my favorite optimizations. I have actually increased execution times and reduced program sizes by using these.
Declare small functions as inline or macros
Each call to a function (or method) incurs overhead, such as pushing variables onto the stack. Some functions may incur an overhead on return as well. An inefficient function or method has fewer statements in its content than the combined overhead. These are good candidates for inlining, whether it be as #define macros or inline functions. (Yes, I know inline is only a suggestion, but in this case I consider it as a reminder to the compiler.)
Remove dead and redundant code
If the code isn't used or does not contribute to the program's result, get rid of it.
Simplify design of algorithms
I once removed a lot of assembly code and execution time from a program by writing down the algebraic equation it was calculating and then simplified the algebraic expression. The implementation of the simplified algebraic expression took up less room and time than the original function.
Loop Unrolling
Each loop has an overhead of incrementing and termination checking. To get an estimate of the performance factor, count the number of instructions in the overhead (minimum 3: increment, check, goto start of loop) and divide by the number of statements inside the loop. The lower the number the better.
Edit: provide an example of loop unrolling
Before:
unsigned int sum = 0;
for (size_t i; i < BYTES_TO_CHECKSUM; ++i)
{
sum += *buffer++;
}
After unrolling:
unsigned int sum = 0;
size_t i = 0;
**const size_t STATEMENTS_PER_LOOP = 8;**
for (i = 0; i < BYTES_TO_CHECKSUM; **i = i / STATEMENTS_PER_LOOP**)
{
sum += *buffer++; // 1
sum += *buffer++; // 2
sum += *buffer++; // 3
sum += *buffer++; // 4
sum += *buffer++; // 5
sum += *buffer++; // 6
sum += *buffer++; // 7
sum += *buffer++; // 8
}
// Handle the remainder:
for (; i < BYTES_TO_CHECKSUM; ++i)
{
sum += *buffer++;
}
In this advantage, a secondary benefit is gained: more statements are executed before the processor has to reload the instruction cache.
I've had amazing results when I unrolled a loop to 32 statements. This was one of the bottlenecks since the program had to calculate a checksum on a 2GB file. This optimization combined with block reading improved performance from 1 hour to 5 minutes. Loop unrolling provided excellent performance in assembly language too, my memcpy was a lot faster than the compiler's memcpy. -- T.M.
Reduction of if statements
Processors hate branches, or jumps, since it forces the processor to reload its queue of instructions.
Boolean Arithmetic (Edited: applied code format to code fragment, added example)
Convert if statements into boolean assignments. Some processors can conditionally execute instructions without branching:
bool status = true;
status = status && /* first test */;
status = status && /* second test */;
The short circuiting of the Logical AND operator (&&) prevents execution of the tests if the status is false.
Example:
struct Reader_Interface
{
virtual bool write(unsigned int value) = 0;
};
struct Rectangle
{
unsigned int origin_x;
unsigned int origin_y;
unsigned int height;
unsigned int width;
bool write(Reader_Interface * p_reader)
{
bool status = false;
if (p_reader)
{
status = p_reader->write(origin_x);
status = status && p_reader->write(origin_y);
status = status && p_reader->write(height);
status = status && p_reader->write(width);
}
return status;
};
Factor Variable Allocation outside of loops
If a variable is created on the fly inside a loop, move the creation / allocation to before the loop. In most instances, the variable doesn't need to be allocated during each iteration.
Factor constant expressions outside of loops
If a calculation or variable value does not depend on the loop index, move it outside (before) the loop.
I/O in blocks
Read and write data in large chunks (blocks). The bigger the better. For example, reading one octect at a time is less efficient than reading 1024 octets with one read.
Example:
static const char Menu_Text[] = "\n"
"1) Print\n"
"2) Insert new customer\n"
"3) Destroy\n"
"4) Launch Nasal Demons\n"
"Enter selection: ";
static const size_t Menu_Text_Length = sizeof(Menu_Text) - sizeof('\0');
//...
std::cout.write(Menu_Text, Menu_Text_Length);
The efficiency of this technique can be visually demonstrated. :-)
Don't use printf family for constant data
Constant data can be output using a block write. Formatted write will waste time scanning the text for formatting characters or processing formatting commands. See above code example.
Format to memory, then write
Format to a char array using multiple sprintf, then use fwrite. This also allows the data layout to be broken up into "constant sections" and variable sections. Think of mail-merge.
Declare constant text (string literals) as static const
When variables are declared without the static, some compilers may allocate space on the stack and copy the data from ROM. These are two unnecessary operations. This can be fixed by using the static prefix.
Lastly, Code like the compiler would
Sometimes, the compiler can optimize several small statements better than one complicated version. Also, writing code to help the compiler optimize helps too. If I want the compiler to use special block transfer instructions, I will write code that looks like it should use the special instructions.

The optimizer isn't really in control of the performance of your program, you are. Use appropriate algorithms and structures and profile, profile, profile.
That said, you shouldn't inner-loop on a small function from one file in another file, as that stops it from being inlined.
Avoid taking the address of a variable if possible. Asking for a pointer isn't "free" as it means the variable needs to be kept in memory. Even an array can be kept in registers if you avoid pointers — this is essential for vectorizing.
Which leads to the next point, read the ^#$# manual! GCC can vectorize plain C code if you sprinkle a __restrict__ here and an __attribute__( __aligned__ ) there. If you want something very specific from the optimizer, you might have to be specific.

On most modern processors, the biggest bottleneck is memory.
Aliasing: Load-Hit-Store can be devastating in a tight loop. If you're reading one memory location and writing to another and know that they are disjoint, carefully putting an alias keyword on the function parameters can really help the compiler generate faster code. However if the memory regions do overlap and you used 'alias', you're in for a good debugging session of undefined behaviors!
Cache-miss: Not really sure how you can help the compiler since it's mostly algorithmic, but there are intrinsics to prefetch memory.
Also don't try to convert floating point values to int and vice versa too much since they use different registers and converting from one type to another means calling the actual conversion instruction, writing the value to memory and reading it back in the proper register set.

The vast majority of code that people write will be I/O bound (I believe all the code I have written for money in the last 30 years has been so bound), so the activities of the optimiser for most folks will be academic.
However, I would remind people that for the code to be optimised you have to tell the compiler to to optimise it - lots of people (including me when I forget) post C++ benchmarks here that are meaningless without the optimiser being enabled.

use const correctness as much as possible in your code. It allows the compiler to optimize much better.
In this document are loads of other optimization tips: CPP optimizations (a bit old document though)
highlights:
use constructor initialization lists
use prefix operators
use explicit constructors
inline functions
avoid temporary objects
be aware of the cost of virtual functions
return objects via reference parameters
consider per class allocation
consider stl container allocators
the 'empty member' optimization
etc

Attempt to program using static single assignment as much as possible. SSA is exactly the same as what you end up with in most functional programming languages, and that's what most compilers convert your code to to do their optimizations because it's easier to work with. By doing this places where the compiler might get confused are brought to light. It also makes all but the worst register allocators work as good as the best register allocators, and allows you to debug more easily because you almost never have to wonder where a variable got it's value from as there was only one place it was assigned.
Avoid global variables.
When working with data by reference or pointer pull that into local variables, do your work, and then copy it back. (unless you have a good reason not to)
Make use of the almost free comparison against 0 that most processors give you when doing math or logic operations. You almost always get a flag for ==0 and <0, from which you can easily get 3 conditions:
x= f();
if(!x){
a();
} else if (x<0){
b();
} else {
c();
}
is almost always cheaper than testing for other constants.
Another trick is to use subtraction to eliminate one compare in range testing.
#define FOO_MIN 8
#define FOO_MAX 199
int good_foo(int foo) {
unsigned int bar = foo-FOO_MIN;
int rc = ((FOO_MAX-FOO_MIN) < bar) ? 1 : 0;
return rc;
}
This can very often avoid a jump in languages that do short circuiting on boolean expressions and avoids the compiler having to try to figure out how to handle keeping
up with the result of the first comparison while doing the second and then combining them.
This may look like it has the potential to use up an extra register, but it almost never does. Often you don't need foo anymore anyway, and if you do rc isn't used yet so it can go there.
When using the string functions in c (strcpy, memcpy, ...) remember what they return -- the destination! You can often get better code by 'forgetting' your copy of the pointer to destination and just grab it back from the return of these functions.
Never overlook the oppurtunity to return exactly the same thing the last function you called returned. Compilers are not so great at picking up that:
foo_t * make_foo(int a, int b, int c) {
foo_t * x = malloc(sizeof(foo));
if (!x) {
// return NULL;
return x; // x is NULL, already in the register used for returns, so duh
}
x->a= a;
x->b = b;
x->c = c;
return x;
}
Of course, you could reverse the logic on that if and only have one return point.
(tricks I recalled later)
Declaring functions as static when you can is always a good idea. If the compiler can prove to itself that it has accounted for every caller of a particular function then it can break the calling conventions for that function in the name of optimization. Compilers can often avoid moving parameters into registers or stack positions that called functions usually expect their parameters to be in (it has to deviate in both the called function and the location of all callers to do this). The compiler can also often take advantage of knowing what memory and registers the called function will need and avoid generating code to preserve variable values that are in registers or memory locations that the called function doesn't disturb. This works particularly well when there are few calls to a function. This gets much of the benifit of inlining code, but without actually inlining.

I wrote an optimizing C compiler and here are some very useful things to consider:
Make most functions static. This allows interprocedural constant propagation and alias analysis to do its job, otherwise the compiler needs to presume that the function can be called from outside the translation unit with completely unknown values for the paramters. If you look at the well-known open-source libraries they all mark functions static except the ones that really need to be extern.
If global variables are used, mark them static and constant if possible. If they are initialized once (read-only), it's better to use an initializer list like static const int VAL[] = {1,2,3,4}, otherwise the compiler might not discover that the variables are actually initialized constants and will fail to replace loads from the variable with the constants.
NEVER use a goto to the inside of a loop, the loop will not be recognized anymore by most compilers and none of the most important optimizations will be applied.
Use pointer parameters only if necessary, and mark them restrict if possible. This helps alias analysis a lot because the programmer guarantees there is no alias (the interprocedural alias analysis is usually very primitive). Very small struct objects should be passed by value, not by reference.
Use arrays instead of pointers whenever possible, especially inside loops (a[i]). An array usually offers more information for alias analysis and after some optimizations the same code will be generated anyway (search for loop strength reduction if curious). This also increases the chance for loop-invariant code motion to be applied.
Try to hoist outside the loop calls to large functions or external functions that don't have side-effects (don't depend on the current loop iteration). Small functions are in many cases inlined or converted to intrinsics that are easy to hoist, but large functions might seem for the compiler to have side-effects when they actually don't. Side-effects for external functions are completely unknown, with the exception of some functions from the standard library which are sometimes modeled by some compilers, making loop-invariant code motion possible.
When writing tests with multiple conditions place the most likely one first. if(a || b || c) should be if(b || a || c) if b is more likely to be true than the others. Compilers usually don't know anything about the possible values of the conditions and which branches are taken more (they could be known by using profile information, but few programmers use it).
Using a switch is faster than doing a test like if(a || b || ... || z). Check first if your compiler does this automatically, some do and it's more readable to have the if though.

In the case of embedded systems and code written in C/C++, I try and avoid dynamic memory allocation as much as possible. The main reason I do this is not necessarily performance but this rule of thumb does have performance implications.
Algorithms used to manage the heap are notoriously slow in some platforms (e.g., vxworks). Even worse, the time that it takes to return from a call to malloc is highly dependent on the current state of the heap. Therefore, any function that calls malloc is going to take a performance hit that cannot be easily accounted for. That performance hit may be minimal if the heap is still clean but after that device runs for a while the heap can become fragmented. The calls are going to take longer and you cannot easily calculate how performance will degrade over time. You cannot really produce a worse case estimate. The optimizer cannot provide you with any help in this case either. To make matters even worse, if the heap becomes too heavily fragmented, the calls will start failing altogether. The solution is to use memory pools (e.g., glib slices ) instead of the heap. The allocation calls are going to be much faster and deterministic if you do it right.

A dumb little tip, but one that will save you some microscopic amounts of speed and code.
Always pass function arguments in the same order.
If you have f_1(x, y, z) which calls f_2, declare f_2 as f_2(x, y, z). Do not declare it as f_2(x, z, y).
The reason for this is that C/C++ platform ABI (AKA calling convention) promises to pass arguments in particular registers and stack locations. When the arguments are already in the correct registers then it does not have to move them around.
While reading disassembled code I've seen some ridiculous register shuffling because people didn't follow this rule.

Two coding technics I didn't saw in the above list:
Bypass linker by writing code as an unique source
While separate compilation is really nice for compiling time, it is very bad when you speak of optimization. Basically the compiler can't optimize beyond compilation unit, that is linker reserved domain.
But if you design well your program you can can also compile it through an unique common source. That is instead of compiling unit1.c and unit2.c then link both objects, compile all.c that merely #include unit1.c and unit2.c. Thus you will benefit from all the compiler optimizations.
It's very like writing headers only programs in C++ (and even easier to do in C).
This technique is easy enough if you write your program to enable it from the beginning, but you must also be aware it change part of C semantic and you can meet some problems like static variables or macro collision. For most programs it's easy enough to overcome the small problems that occurs. Also be aware that compiling as an unique source is way slower and may takes huge amount of memory (usually not a problem with modern systems).
Using this simple technique I happened to make some programs I wrote ten times faster!
Like the register keyword, this trick could also become obsolete soon. Optimizing through linker begin to be supported by compilers gcc: Link time optimization.
Separate atomic tasks in loops
This one is more tricky. It's about interaction between algorithm design and the way optimizer manage cache and register allocation. Quite often programs have to loop over some data structure and for each item perform some actions. Quite often the actions performed can be splitted between two logically independent tasks. If that is the case you can write exactly the same program with two loops on the same boundary performing exactly one task. In some case writing it this way can be faster than the unique loop (details are more complex, but an explanation can be that with the simple task case all variables can be kept in processor registers and with the more complex one it's not possible and some registers must be written to memory and read back later and the cost is higher than additional flow control).
Be careful with this one (profile performances using this trick or not) as like using register it may as well give lesser performances than improved ones.

I've actually seen this done in SQLite and they claim it results in performance boosts ~5%: Put all your code in one file or use the preprocessor to do the equivalent to this. This way the optimizer will have access to the entire program and can do more interprocedural optimizations.

Most modern compilers should do a good job speeding up tail recursion, because the function calls can be optimized out.
Example:
int fac2(int x, int cur) {
if (x == 1) return cur;
return fac2(x - 1, cur * x);
}
int fac(int x) {
return fac2(x, 1);
}
Of course this example doesn't have any bounds checking.
Late Edit
While I have no direct knowledge of the code; it seems clear that the requirements of using CTEs on SQL Server were specifically designed so that it can optimize via tail-end recursion.

Don't do the same work over and over again!
A common antipattern that I see goes along these lines:
void Function()
{
MySingleton::GetInstance()->GetAggregatedObject()->DoSomething();
MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingElse();
MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingCool();
MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingReallyNeat();
MySingleton::GetInstance()->GetAggregatedObject()->DoSomethingYetAgain();
}
The compiler actually has to call all of those functions all of the time. Assuming you, the programmer, knows that the aggregated object isn't changing over the course of these calls, for the love of all that is holy...
void Function()
{
MySingleton* s = MySingleton::GetInstance();
AggregatedObject* ao = s->GetAggregatedObject();
ao->DoSomething();
ao->DoSomethingElse();
ao->DoSomethingCool();
ao->DoSomethingReallyNeat();
ao->DoSomethingYetAgain();
}
In the case of the singleton getter the calls may not be too costly, but it is certainly a cost (typically, "check to see if the object has been created, if it hasn't, create it, then return it). The more complicated this chain of getters becomes, the more wasted time we'll have.

Use the most local scope possible for all variable declarations.
Use const whenever possible
Dont use register unless you plan to profile both with and without it
The first 2 of these, especially #1 one help the optimizer analyze the code. It will especially help it to make good choices about what variables to keep in registers.
Blindly using the register keyword is as likely to help as hurt your optimization, It's just too hard to know what will matter until you look at the assembly output or profile.
There are other things that matter to getting good performance out of code; designing your data structures to maximize cache coherency for instance. But the question was about the optimizer.

Align your data to native/natural boundaries.

I was reminded of something that I encountered once, where the symptom was simply that we were running out of memory, but the result was substantially increased performance (as well as huge reductions in memory footprint).
The problem in this case was that the software we were using made tons of little allocations. Like, allocating four bytes here, six bytes there, etc. A lot of little objects, too, running in the 8-12 byte range. The problem wasn't so much that the program needed lots of little things, it's that it allocated lots of little things individually, which bloated each allocation out to (on this particular platform) 32 bytes.
Part of the solution was to put together an Alexandrescu-style small object pool, but extend it so I could allocate arrays of small objects as well as individual items. This helped immensely in performance as well since more items fit in the cache at any one time.
The other part of the solution was to replace the rampant use of manually-managed char* members with an SSO (small-string optimization) string. The minimum allocation being 32 bytes, I built a string class that had an embedded 28-character buffer behind a char*, so 95% of our strings didn't need to do an additional allocation (and then I manually replaced almost every appearance of char* in this library with this new class, that was fun or not). This helped a ton with memory fragmentation as well, which then increased the locality of reference for other pointed-to objects, and similarly there were performance gains.

A neat technique I learned from #MSalters comment on this answer allows compilers to do copy elision even when returning different objects according to some condition:
// before
BigObject a, b;
if(condition)
return a;
else
return b;
// after
BigObject a, b;
if(condition)
swap(a,b);
return a;

If you've got small functions you call repeatedly, i have in the past got large gains by putting them in headers as "static inline". Function calls on the ix86 are surprisingly expensive.
Reimplementing recursive functions in a non-recursive way using an explicit stack can also gain a lot, but then you really are in the realm of development time vs gain.

Here's my second piece of optimisation advice. As with my first piece of advice this is general purpose, not language or processor specific.
Read the compiler manual thoroughly and understand what it is telling you. Use the compiler to its utmost.
I agree with one or two of the other respondents who have identified selecting the right algorithm as critical to squeezing performance out of a program. Beyond that the rate of return (measured in code execution improvement) on the time you invest in using the compiler is far higher than the rate of return in tweaking the code.
Yes, compiler writers are not from a race of coding giants and compilers contain mistakes and what should, according to the manual and according to compiler theory, make things faster sometimes makes things slower. That's why you have to take one step at a time and measure before- and after-tweak performance.
And yes, ultimately, you might be faced with a combinatorial explosion of compiler flags so you need to have a script or two to run make with various compiler flags, queue the jobs on the large cluster and gather the run time statistics. If it's just you and Visual Studio on a PC you will run out of interest long before you have tried enough combinations of enough compiler flags.
Regards
Mark
When I first pick up a piece of code I can usually get a factor of 1.4 -- 2.0 times more performance (ie the new version of the code runs in 1/1.4 or 1/2 of the time of the old version) within a day or two by fiddling with compiler flags. Granted, that may be a comment on the lack of compiler savvy among the scientists who originate much of the code I work on, rather than a symptom of my excellence. Having set the compiler flags to max (and it's rarely just -O3) it can take months of hard work to get another factor of 1.05 or 1.1

When DEC came out with its alpha processors, there was a recommendation to keep the number of arguments to a function under 7, as the compiler would always try to put up to 6 arguments in registers automatically.

For performance, focus first on writing maintenable code - componentized, loosely coupled, etc, so when you have to isolate a part either to rewrite, optimize or simply profile, you can do it without much effort.
Optimizer will help your program's performance marginally.

You're getting good answers here, but they assume your program is pretty close to optimal to begin with, and you say
Assume that the program has been
written correctly, compiled with full
optimization, tested and put into
production.
In my experience, a program may be written correctly, but that does not mean it is near optimal. It takes extra work to get to that point.
If I can give an example, this answer shows how a perfectly reasonable-looking program was made over 40 times faster by macro-optimization. Big speedups can't be done in every program as first written, but in many (except for very small programs), it can, in my experience.
After that is done, micro-optimization (of the hot-spots) can give you a good payoff.

i use intel compiler. on both Windows and Linux.
when more or less done i profile the code. then hang on the hotspots and trying to change the code to allow compiler make a better job.
if a code is a computational one and contain a lot of loops - vectorization report in intel compiler is very helpful - look for 'vec-report' in help.
so the main idea - polish the performance critical code. as for the rest - priority to be correct and maintainable - short functions, clear code that could be understood 1 year later.

One optimization i have used in C++ is creating a constructor that does nothing. One must manually call an init() in order to put the object into a working state.
This has benefit in the case where I need a large vector of these classes.
I call reserve() to allocate the space for the vector, but the constructor does not actually touch the page of memory the object is on. So I have spent some address space, but not actually consumed a lot of physical memory. I avoid the page faults associated the associated construction costs.
As i generate objects to fill the vector, I set them using init(). This limits my total page faults, and avoids the need to resize() the vector while filling it.

One thing I've done is try to keep expensive actions to places where the user might expect the program to delay a bit. Overall performance is related to responsiveness, but isn't quite the same, and for many things responsiveness is the more important part of performance.
The last time I really had to do improvements in overall performance, I kept an eye out for suboptimal algorithms, and looked for places that were likely to have cache problems. I profiled and measured performance first, and again after each change. Then the company collapsed, but it was interesting and instructive work anyway.

I have long suspected, but never proved that declaring arrays so that they hold a power of 2, as the number of elements, enables the optimizer to do a strength reduction by replacing a multiply by a shift by a number of bits, when looking up individual elements.

Put small and/or frequently called functions at the top of the source file. That makes it easier for the compiler to find opportunities for inlining.

What kinds of C optimization practices in x86 that were historically recommended to use are no longer effective?

Due to the advances in x86 C compilers (namely GCC and Clang), many coding practices that were believed to improve efficiency are no longer used since the compilers can do a better job optimizing the code than humans (e.g. bit shift vs. multiplication).
Which specific practices are these?

Of the optimizations which are commonly recommended, a couple which are basically never fruitful given modern compilers include:
Mathematical transformations
Modern compilers understand mathematics, and will perform transformations on mathematical expressions where appropriate.
Optimizations such as conversion of multiplication to addition, or constant multiplication or division to bit shifting, are already performed by modern compilers, even at low optimization levels. Examples of these optimizations include:
x * 2 -> x + x
x * 2 -> x << 1
Note that some specific cases may differ. For instance, x >> 1 is not the same as x / 2; it is not appropriate to substitute one for the other!
Additionally, many of these suggested optimizations aren't actually any faster than the code they replace.
Stupid code tricks
I'm not even sure what to call this, but tricks like XOR swapping (a ^= b; b ^= a; a ^= b;) are not optimizations at all. They're just party tricks — they are slower, and more fragile, than the obvious approach. Don't use them.
The register keyword
This keyword is ignored by many modern compilers, as it its intended meaning (forcing a variable to be stored in a register) is not meaningful given current register allocation algorithms.
Code transformations
Compilers will automatically perform a wide variety of code transformations where appropriate. Several such transformations which are frequently recommended for manual application, but which are rarely useful when applied thus, include:
Loop unrolling. (This is often actually harmful when applied indiscriminately, as it bloats code size.)
Function inlining. (Tag a function as static, and it'll usually be inlined where appropriate when optimization is enabled.)

One such practice is to avoid multiplications by using arrays of array pointers instead of real 2D arrays.
Old practice:
int width = 1234, height = 5678;
int* buffer = malloc(width*height*sizeof(*buffer));
int** image = malloc(height*sizeof(*image));
for(int i = height; i--; ) image[i] = &buffer[i*width];
//Now do some heavy computations with image[y][x].
This used to be faster, because multiplications used to be very expensive (on the order of 30 CPU cycles), while memory accesses were virtually free (it was only in the 1990s that caches were added because memory couldn't keep up with full CPU speed).
But multiplications became fast, some CPUs being able to do them in one CPU cycle, while memory accesses did not keep pace at all. So, now this code is likely to be more performant:
int width = 1234, height = 5678;
int (*image)[width] = malloc(height*sizeof(*image));
//Now do some heavy computations with image[y][x],
//which will invoke pointer arithmetic to calculate the offset as (y*width + x)*sizeof(int).
Currently, there are still some CPUs around, where the second code is not faster, but the big multiplication penalty is not with us anymore.

Due to the plurality of platforms you would at best optimize for a given platform (or CPU Architecture/Model) and compiler!! If your code is running on many platforms it's a waste of time. (I'm talking about micro opts, it's always worth considering better algorithms)
This said optimizing for a given platform, DSP makes sense if the need for it arises. Then the best first helper is IMHO the judicious use of restrict keyword if the compiler/optimizer supports it well. Avoid algorithms involving conditions and jumpy code (breaks, goto, if, while, ...) This favors streaming and avoids too many bad branch predictions. I would agree these hints are common sense now.
Generally speaking I would say: Any manipulation that modifies the code by making assumptions on how the compiler optimizes shall be IMHO avoided at all.
Rather, then switch to assembly (common practice for some really important algorithms in DSPs where the compilers while being really great still miss the last few % of CPU/Mem cycles performance increase...)

One optimization that really shouldn't be used much more is #define (expanding on duskwuff's answer a bit).
The C preprocessor is a wonderful thing, and it can do some amazing code transformations, and it can make certain really complex code much simpler — but using #define just to cause a small operation to be inlined isn't usually appropriate anymore. Most modern compilers have a real inline keyword (or equivalent, like __inline__), and they're smart enough to inline most static functions anyway, which means that code like this:
#define sum(x, y) ((x) + (y))
is really better written as the equivalent function:
static int sum(int x, int y)
{
return x + y;
}
You avoid dangerous multiple-evaluation problems and side-effects, you get compiler type-checking and you end up with cleaner code, too. If it's worth inlining, the compiler will do it.
In general, save the preprocessor for the circumstances where it's needed: Emitting a lot of complex, variant code or partial code quickly. Using the preprocessor for inlining small functions and defining constants is mostly an antipattern now.

When is assembly faster than C? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Closed 1 year ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
One of the stated reasons for knowing assembler is that, on occasion, it can be employed to write code that will be more performant than writing that code in a higher-level language, C in particular. However, I've also heard it stated many times that although that's not entirely false, the cases where assembler can actually be used to generate more performant code are both extremely rare and require expert knowledge of and experience with assembly.
This question doesn't even get into the fact that assembler instructions will be machine-specific and non-portable, or any of the other aspects of assembler. There are plenty of good reasons for knowing assembly besides this one, of course, but this is meant to be a specific question soliciting examples and data, not an extended discourse on assembler versus higher-level languages.
Can anyone provide some specific examples of cases where assembly will be faster than well-written C code using a modern compiler, and can you support that claim with profiling evidence? I am pretty confident these cases exist, but I really want to know exactly how esoteric these cases are, since it seems to be a point of some contention.

Here is a real world example: Fixed point multiplies on old compilers.
These don't only come handy on devices without floating point, they shine when it comes to precision as they give you 32 bits of precision with a predictable error (float only has 23 bit and it's harder to predict precision loss). i.e. uniform absolute precision over the entire range, instead of close-to-uniform relative precision (float).
Modern compilers optimize this fixed-point example nicely, so for more modern examples that still need compiler-specific code, see
Getting the high part of 64 bit integer multiplication: A portable version using uint64_t for 32x32 => 64-bit multiplies fails to optimize on a 64-bit CPU, so you need intrinsics or __int128 for efficient code on 64-bit systems.
_umul128 on Windows 32 bits: MSVC doesn't always do a good job when multiplying 32-bit integers cast to 64, so intrinsics helped a lot.
C doesn't have a full-multiplication operator (2N-bit result from N-bit inputs). The usual way to express it in C is to cast the inputs to the wider type and hope the compiler recognizes that the upper bits of the inputs aren't interesting:
// on a 32-bit machine, int can hold 32-bit fixed-point integers.
int inline FixedPointMul (int a, int b)
{
long long a_long = a; // cast to 64 bit.
long long product = a_long * b; // perform multiplication
return (int) (product >> 16); // shift by the fixed point bias
}
The problem with this code is that we do something that can't be directly expressed in the C-language. We want to multiply two 32 bit numbers and get a 64 bit result of which we return the middle 32 bit. However, in C this multiply does not exist. All you can do is to promote the integers to 64 bit and do a 64*64 = 64 multiply.
x86 (and ARM, MIPS and others) can however do the multiply in a single instruction. Some compilers used to ignore this fact and generate code that calls a runtime library function to do the multiply. The shift by 16 is also often done by a library routine (also the x86 can do such shifts).
So we're left with one or two library calls just for a multiply. This has serious consequences. Not only is the shift slower, registers must be preserved across the function calls and it does not help inlining and code-unrolling either.
If you rewrite the same code in (inline) assembler you can gain a significant speed boost.
In addition to this: using ASM is not the best way to solve the problem. Most compilers allow you to use some assembler instructions in intrinsic form if you can't express them in C. The VS.NET2008 compiler for example exposes the 32*32=64 bit mul as __emul and the 64 bit shift as __ll_rshift.
Using intrinsics you can rewrite the function in a way that the C-compiler has a chance to understand what's going on. This allows the code to be inlined, register allocated, common subexpression elimination and constant propagation can be done as well. You'll get a huge performance improvement over the hand-written assembler code that way.
For reference: The end-result for the fixed-point mul for the VS.NET compiler is:
int inline FixedPointMul (int a, int b)
{
return (int) __ll_rshift(__emul(a,b),16);
}
The performance difference of fixed point divides is even bigger. I had improvements up to factor 10 for division heavy fixed point code by writing a couple of asm-lines.
Using Visual C++ 2013 gives the same assembly code for both ways.
gcc4.1 from 2007 also optimizes the pure C version nicely. (The Godbolt compiler explorer doesn't have any earlier versions of gcc installed, but presumably even older GCC versions could do this without intrinsics.)
See source + asm for x86 (32-bit) and ARM on the Godbolt compiler explorer. (Unfortunately it doesn't have any compilers old enough to produce bad code from the simple pure C version.)
Modern CPUs can do things C doesn't have operators for at all, like popcnt or bit-scan to find the first or last set bit. (POSIX has a ffs() function, but its semantics don't match x86 bsf / bsr. See https://en.wikipedia.org/wiki/Find_first_set).
Some compilers can sometimes recognize a loop that counts the number of set bits in an integer and compile it to a popcnt instruction (if enabled at compile time), but it's much more reliable to use __builtin_popcnt in GNU C, or on x86 if you're only targeting hardware with SSE4.2: _mm_popcnt_u32 from <immintrin.h>.
Or in C++, assign to a std::bitset<32> and use .count(). (This is a case where the language has found a way to portably expose an optimized implementation of popcount through the standard library, in a way that will always compile to something correct, and can take advantage of whatever the target supports.) See also https://en.wikipedia.org/wiki/Hamming_weight#Language_support.
Similarly, ntohl can compile to bswap (x86 32-bit byte swap for endian conversion) on some C implementations that have it.
Another major area for intrinsics or hand-written asm is manual vectorization with SIMD instructions. Compilers are not bad with simple loops like dst[i] += src[i] * 10.0;, but often do badly or don't auto-vectorize at all when things get more complicated. For example, you're unlikely to get anything like How to implement atoi using SIMD? generated automatically by the compiler from scalar code.

Many years ago I was teaching someone to program in C. The exercise was to rotate a graphic through 90 degrees. He came back with a solution that took several minutes to complete, mainly because he was using multiplies and divides etc.
I showed him how to recast the problem using bit shifts, and the time to process came down to about 30 seconds on the non-optimizing compiler he had.
I had just got an optimizing compiler and the same code rotated the graphic in < 5 seconds. I looked at the assembly code that the compiler was generating, and from what I saw decided there and then that my days of writing assembler were over.

Pretty much anytime the compiler sees floating point code, a hand written version will be quicker if you're using an old bad compiler. (2019 update: This is not true in general for modern compilers. Especially when compiling for anything other than x87; compilers have an easier time with SSE2 or AVX for scalar math, or any non-x86 with a flat FP register set, unlike x87's register stack.)
The primary reason is that the compiler can't perform any robust optimisations. See this article from MSDN for a discussion on the subject. Here's an example where the assembly version is twice the speed as the C version (compiled with VS2K5):
#include "stdafx.h"
#include <windows.h>
float KahanSum(const float *data, int n)
{
float sum = 0.0f, C = 0.0f, Y, T;
for (int i = 0 ; i < n ; ++i) {
Y = *data++ - C;
T = sum + Y;
C = T - sum - Y;
sum = T;
}
return sum;
}
float AsmSum(const float *data, int n)
{
float result = 0.0f;
_asm
{
mov esi,data
mov ecx,n
fldz
fldz
l1:
fsubr [esi]
add esi,4
fld st(0)
fadd st(0),st(2)
fld st(0)
fsub st(0),st(3)
fsub st(0),st(2)
fstp st(2)
fstp st(2)
loop l1
fstp result
fstp result
}
return result;
}
int main (int, char **)
{
int count = 1000000;
float *source = new float [count];
for (int i = 0 ; i < count ; ++i) {
source [i] = static_cast <float> (rand ()) / static_cast <float> (RAND_MAX);
}
LARGE_INTEGER start, mid, end;
float sum1 = 0.0f, sum2 = 0.0f;
QueryPerformanceCounter (&start);
sum1 = KahanSum (source, count);
QueryPerformanceCounter (&mid);
sum2 = AsmSum (source, count);
QueryPerformanceCounter (&end);
cout << " C code: " << sum1 << " in " << (mid.QuadPart - start.QuadPart) << endl;
cout << "asm code: " << sum2 << " in " << (end.QuadPart - mid.QuadPart) << endl;
return 0;
}
And some numbers from my PC running a default release build*:
C code: 500137 in 103884668
asm code: 500137 in 52129147
Out of interest, I swapped the loop with a dec/jnz and it made no difference to the timings - sometimes quicker, sometimes slower. I guess the memory limited aspect dwarfs other optimisations. (Editor's note: more likely the FP latency bottleneck is enough to hide the extra cost of loop. Doing two Kahan summations in parallel for the odd/even elements, and adding those at the end, could maybe speed this up by a factor of 2.)
Whoops, I was running a slightly different version of the code and it outputted the numbers the wrong way round (i.e. C was faster!). Fixed and updated the results.

Without giving any specific example or profiler evidence, you can write better assembler than the compiler when you know more than the compiler.
In the general case, a modern C compiler knows much more about how to optimize the code in question: it knows how the processor pipeline works, it can try to reorder instructions quicker than a human can, and so on - it's basically the same as a computer being as good as or better than the best human player for boardgames, etc. simply because it can make searches within the problem space faster than most humans. Although you theoretically can perform as well as the computer in a specific case, you certainly can't do it at the same speed, making it infeasible for more than a few cases (i.e. the compiler will most certainly outperform you if you try to write more than a few routines in assembler).
On the other hand, there are cases where the compiler does not have as much information - I'd say primarily when working with different forms of external hardware, of which the compiler has no knowledge. The primary example probably being device drivers, where assembler combined with a human's intimate knowledge of the hardware in question can yield better results than a C compiler could do.
Others have mentioned special purpose instructions, which is what I'm talking in the paragraph above - instructions of which the compiler might have limited or no knowledge at all, making it possible for a human to write faster code.

In my job, there are three reasons for me to know and use assembly. In order of importance:
Debugging - I often get library code that has bugs or incomplete documentation. I figure out what it's doing by stepping in at the assembly level. I have to do this about once a week. I also use it as a tool to debug problems in which my eyes don't spot the idiomatic error in C/C++/C#. Looking at the assembly gets past that.
Optimizing - the compiler does fairly well in optimizing, but I play in a different ballpark than most. I write image processing code that usually starts with code that looks like this:
for (int y=0; y < imageHeight; y++) {
for (int x=0; x < imageWidth; x++) {
// do something
}
}
the "do something part" typically happens on the order of several million times (ie, between 3 and 30). By scraping cycles in that "do something" phase, the performance gains are hugely magnified. I don't usually start there - I usually start by writing the code to work first, then do my best to refactor the C to be naturally better (better algorithm, less load in the loop etc). I usually need to read assembly to see what's going on and rarely need to write it. I do this maybe every two or three months.
doing something the language won't let me. These include - getting the processor architecture and specific processor features, accessing flags not in the CPU (man, I really wish C gave you access to the carry flag), etc. I do this maybe once a year or two years.

Only when using some special purpose instruction sets the compiler doesn't support.
To maximize the computing power of a modern CPU with multiple pipelines and predictive branching you need to structure the assembly program in a way that makes it a) almost impossible for a human to write b) even more impossible to maintain.
Also, better algorithms, data structures and memory management will give you at least an order of magnitude more performance than the micro-optimizations you can do in assembly.

Although C is "close" to the low-level manipulation of 8-bit, 16-bit, 32-bit, 64-bit data, there are a few mathematical operations not supported by C which can often be performed elegantly in certain assembly instruction sets:
Fixed-point multiplication: The product of two 16-bit numbers is a 32-bit number. But the rules in C says that the product of two 16-bit numbers is a 16-bit number, and the product of two 32-bit numbers is a 32-bit number -- the bottom half in both cases. If you want the top half of a 16x16 multiply or a 32x32 multiply, you have to play games with the compiler. The general method is to cast to a larger-than-necessary bit width, multiply, shift down, and cast back:
int16_t x, y;
// int16_t is a typedef for "short"
// set x and y to something
int16_t prod = (int16_t)(((int32_t)x*y)>>16);`
In this case the compiler may be smart enough to know that you're really just trying to get the top half of a 16x16 multiply and do the right thing with the machine's native 16x16multiply. Or it may be stupid and require a library call to do the 32x32 multiply that's way overkill because you only need 16 bits of the product -- but the C standard doesn't give you any way to express yourself.
Certain bitshifting operations (rotation/carries):
// 256-bit array shifted right in its entirety:
uint8_t x[32];
for (int i = 32; --i > 0; )
{
x[i] = (x[i] >> 1) | (x[i-1] << 7);
}
x[0] >>= 1;
This is not too inelegant in C, but again, unless the compiler is smart enough to realize what you are doing, it's going to do a lot of "unnecessary" work. Many assembly instruction sets allow you to rotate or shift left/right with the result in the carry register, so you could accomplish the above in 34 instructions: load a pointer to the beginning of the array, clear the carry, and perform 32 8-bit right-shifts, using auto-increment on the pointer.
For another example, there are linear feedback shift registers (LFSR) that are elegantly performed in assembly: Take a chunk of N bits (8, 16, 32, 64, 128, etc), shift the whole thing right by 1 (see above algorithm), then if the resulting carry is 1 then you XOR in a bit pattern that represents the polynomial.
Having said that, I wouldn't resort to these techniques unless I had serious performance constraints. As others have said, assembly is much harder to document/debug/test/maintain than C code: the performance gain comes with some serious costs.
edit: 3. Overflow detection is possible in assembly (can't really do it in C), this makes some algorithms much easier.

Short answer? Sometimes.
Technically every abstraction has a cost and a programming language is an abstraction for how the CPU works. C however is very close. Years ago I remember laughing out loud when I logged onto my UNIX account and got the following fortune message (when such things were popular):
The C Programming Language -- A
language which combines the
flexibility of assembly language with
the power of assembly language.
It's funny because it's true: C is like portable assembly language.
It's worth noting that assembly language just runs however you write it. There is however a compiler in between C and the assembly language it generates and that is extremely important because how fast your C code is has an awful lot to do with how good your compiler is.
When gcc came on the scene one of the things that made it so popular was that it was often so much better than the C compilers that shipped with many commercial UNIX flavours. Not only was it ANSI C (none of this K&R C rubbish), was more robust and typically produced better (faster) code. Not always but often.
I tell you all this because there is no blanket rule about the speed of C and assembler because there is no objective standard for C.
Likewise, assembler varies a lot depending on what processor you're running, your system spec, what instruction set you're using and so on. Historically there have been two CPU architecture families: CISC and RISC. The biggest player in CISC was and still is the Intel x86 architecture (and instruction set). RISC dominated the UNIX world (MIPS6000, Alpha, Sparc and so on). CISC won the battle for the hearts and minds.
Anyway, the popular wisdom when I was a younger developer was that hand-written x86 could often be much faster than C because the way the architecture worked, it had a complexity that benefitted from a human doing it. RISC on the other hand seemed designed for compilers so noone (I knew) wrote say Sparc assembler. I'm sure such people existed but no doubt they've both gone insane and been institutionalized by now.
Instruction sets are an important point even in the same family of processors. Certain Intel processors have extensions like SSE through SSE4. AMD had their own SIMD instructions. The benefit of a programming language like C was someone could write their library so it was optimized for whichever processor you were running on. That was hard work in assembler.
There are still optimizations you can make in assembler that no compiler could make and a well written assembler algoirthm will be as fast or faster than it's C equivalent. The bigger question is: is it worth it?
Ultimately though assembler was a product of its time and was more popular at a time when CPU cycles were expensive. Nowadays a CPU that costs $5-10 to manufacture (Intel Atom) can do pretty much anything anyone could want. The only real reason to write assembler these days is for low level things like some parts of an operating system (even so the vast majority of the Linux kernel is written in C), device drivers, possibly embedded devices (although C tends to dominate there too) and so on. Or just for kicks (which is somewhat masochistic).

I'm surprised no one said this. The strlen() function is much faster if written in assembly! In C, the best thing you can do is
int c;
for(c = 0; str[c] != '\0'; c++) {}
while in assembly you can speed it up considerably:
mov esi, offset string
mov edi, esi
xor ecx, ecx
lp:
mov ax, byte ptr [esi]
cmp al, cl
je end_1
cmp ah, cl
je end_2
mov bx, byte ptr [esi + 2]
cmp bl, cl
je end_3
cmp bh, cl
je end_4
add esi, 4
jmp lp
end_4:
inc esi
end_3:
inc esi
end_2:
inc esi
end_1:
inc esi
mov ecx, esi
sub ecx, edi
the length is in ecx. This compares 4 characters at time, so it's 4 times faster. And think using the high order word of eax and ebx, it will become 8 times faster that the previous C routine!

A use case which might not apply anymore but for your nerd pleasure: On the Amiga, the CPU and the graphics/audio chips would fight for accessing a certain area of RAM (the first 2MB of RAM to be specific). So when you had only 2MB RAM (or less), displaying complex graphics plus playing sound would kill the performance of the CPU.
In assembler, you could interleave your code in such a clever way that the CPU would only try to access the RAM when the graphics/audio chips were busy internally (i.e. when the bus was free). So by reordering your instructions, clever use of the CPU cache, the bus timing, you could achieve some effects which were simply not possible using any higher level language because you had to time every command, even insert NOPs here and there to keep the various chips out of each others radar.
Which is another reason why the NOP (No Operation - do nothing) instruction of the CPU can actually make your whole application run faster.
[EDIT] Of course, the technique depends on a specific hardware setup. Which was the main reason why many Amiga games couldn't cope with faster CPUs: The timing of the instructions was off.

Point one which is not the answer.
Even if you never program in it, I find it useful to know at least one assembler instruction set. This is part of the programmers never-ending quest to know more and therefore be better. Also useful when stepping into frameworks you don't have the source code to and having at least a rough idea what is going on. It also helps you to understand JavaByteCode and .Net IL as they are both similar to assembler.
To answer the question when you have a small amount of code or a large amount of time. Most useful for use in embedded chips, where low chip complexity and poor competition in compilers targeting these chips can tip the balance in favour of humans. Also for restricted devices you are often trading off code size/memory size/performance in a way that would be hard to instruct a compiler to do. e.g. I know this user action is not called often so I will have small code size and poor performance, but this other function that look similar is used every second so I will have a larger code size and faster performance. That is the sort of trade off a skilled assembly programmer can use.
I would also like to add there is a lot of middle ground where you can code in C compile and examine the Assembly produced, then either change you C code or tweak and maintain as assembly.
My friend works on micro controllers, currently chips for controlling small electric motors. He works in a combination of low level c and Assembly. He once told me of a good day at work where he reduced the main loop from 48 instructions to 43. He is also faced with choices like the code has grown to fill the 256k chip and the business is wanting a new feature, do you
Remove an existing feature
Reduce the size of some or all of the existing features maybe at the cost of performance.
Advocate moving to a larger chip with a higher cost, higher power consumption and larger form factor.
I would like to add as a commercial developer with quite a portfolio or languages, platforms, types of applications I have never once felt the need to dive into writing assembly. I have how ever always appreciated the knowledge I gained about it. And sometimes debugged into it.
I know I have far more answered the question "why should I learn assembler" but I feel it is a more important question then when is it faster.
so lets try once more
You should be thinking about assembly
working on low level operating system function
Working on a compiler.
Working on an extremely limited chip, embedded system etc
Remember to compare your assembly to compiler generated to see which is faster/smaller/better.
David.

Matrix operations using SIMD instructions is probably faster than compiler generated code.

A few examples from my experience:
Access to instructions that are not accessible from C. For instance, many architectures (like x86-64, IA-64, DEC Alpha, and 64-bit MIPS or PowerPC) support a 64 bit by 64 bit multiplication producing a 128 bit result. GCC recently added an extension providing access to such instructions, but before that assembly was required. And access to this instruction can make a huge difference on 64-bit CPUs when implementing something like RSA - sometimes as much as a factor of 4 improvement in performance.
Access to CPU-specific flags. The one that has bitten me a lot is the carry flag; when doing a multiple-precision addition, if you don't have access to the CPU carry bit one must instead compare the result to see if it overflowed, which takes 3-5 more instructions per limb; and worse, which are quite serial in terms of data accesses, which kills performance on modern superscalar processors. When processing thousands of such integers in a row, being able to use addc is a huge win (there are superscalar issues with contention on the carry bit as well, but modern CPUs deal pretty well with it).
SIMD. Even autovectorizing compilers can only do relatively simple cases, so if you want good SIMD performance it's unfortunately often necessary to write the code directly. Of course you can use intrinsics instead of assembly but once you're at the intrinsics level you're basically writing assembly anyway, just using the compiler as a register allocator and (nominally) instruction scheduler. (I tend to use intrinsics for SIMD simply because the compiler can generate the function prologues and whatnot for me so I can use the same code on Linux, OS X, and Windows without having to deal with ABI issues like function calling conventions, but other than that the SSE intrinsics really aren't very nice - the Altivec ones seem better though I don't have much experience with them). As examples of things a (current day) vectorizing compiler can't figure out, read about bitslicing AES or SIMD error correction - one could imagine a compiler that could analyze algorithms and generate such code, but it feels to me like such a smart compiler is at least 30 years away from existing (at best).
On the other hand, multicore machines and distributed systems have shifted many of the biggest performance wins in the other direction - get an extra 20% speedup writing your inner loops in assembly, or 300% by running them across multiple cores, or 10000% by running them across a cluster of machines. And of course high level optimizations (things like futures, memoization, etc) are often much easier to do in a higher level language like ML or Scala than C or asm, and often can provide a much bigger performance win. So, as always, there are tradeoffs to be made.

I can't give the specific examples because it was too many years ago, but there were plenty of cases where hand-written assembler could out-perform any compiler. Reasons why:
You could deviate from calling conventions, passing arguments in registers.
You could carefully consider how to use registers, and avoid storing variables in memory.
For things like jump tables, you could avoid having to bounds-check the index.
Basically, compilers do a pretty good job of optimizing, and that is nearly always "good enough", but in some situations (like graphics rendering) where you're paying dearly for every single cycle, you can take shortcuts because you know the code, where a compiler could not because it has to be on the safe side.
In fact, I have heard of some graphics rendering code where a routine, like a line-draw or polygon-fill routine, actually generated a small block of machine code on the stack and executed it there, so as to avoid continual decision-making about line style, width, pattern, etc.
That said, what I want a compiler to do is generate good assembly code for me but not be too clever, and they mostly do that. In fact, one of the things I hate about Fortran is its scrambling the code in an attempt to "optimize" it, usually to no significant purpose.
Usually, when apps have performance problems, it is due to wasteful design. These days, I would never recommend assembler for performance unless the overall app had already been tuned within an inch of its life, still was not fast enough, and was spending all its time in tight inner loops.
Added: I've seen plenty of apps written in assembly language, and the main speed advantage over a language like C, Pascal, Fortran, etc. was because the programmer was far more careful when coding in assembler. He or she is going to write roughly 100 lines of code a day, regardless of language, and in a compiler language that's going to equal 3 or 400 instructions.

More often than you think, C needs to do things that seem to be unneccessary from an Assembly coder's point of view just because the C standards say so.
Integer promotion, for example. If you want to shift a char variable in C, one would usually expect that the code would do in fact just that, a single bit shift.
The standards, however, enforce the compiler to do a sign extend to int before the shift and truncate the result to char afterwards which might complicate code depending on the target processor's architecture.

You don't actually know whether your well-written C code is really fast if you haven't looked at the disassembly of what compiler produces. Many times you look at it and see that "well-written" was subjective.
So it's not necessary to write in assembler to get fastest code ever, but it's certainly worth to know assembler for the very same reason.

Tight loops, like when playing with images, since an image may cosist of millions of pixels. Sitting down and figuring out how to make best use of the limited number of processor registers can make a difference. Here's a real life sample:
http://danbystrom.se/2008/12/22/optimizing-away-ii/
Then often processors have some esoteric instructions which are too specialized for a compiler to bother with, but on occasion an assembler programmer can make good use of them. Take the XLAT instruction for example. Really great if you need to do table look-ups in a loop and the table is limited to 256 bytes!
Updated: Oh, just come to think of what's most crucial when we speak of loops in general: the compiler has often no clue on how many iterations that will be the common case! Only the programmer know that a loop will be iterated MANY times and that it therefore will be beneficial to prepare for the loop with some extra work, or if it will be iterated so few times that the set-up actually will take longer than the iterations expected.

I have read all the answers (more than 30) and didn't find a simple reason: assembler is faster than C if you have read and practiced the Intel® 64 and IA-32 Architectures Optimization Reference Manual, so the reason why assembly may be slower is that people who write such slower assembly didn't read the Optimization Manual.
In the good old days of Intel 80286, each instruction was executed at a fixed count of CPU cycles. Still, since Pentium Pro, released in 1995, Intel processors became superscalar, utilizing Complex Pipelining: Out-of-Order Execution & Register Renaming. Before that, on Pentium, produced in 1993, there were U and V pipelines. Therefore, Pentium introduced dual pipelines that could execute two simple instructions at one clock cycle if they didn't depend on one another. However, this was nothing compared with the Out-of-Order Execution & Register Renaming that appeared in Pentium Pro. This approach introduced in Pentium Pro is practically the same nowadays on most recent Intel processors.
Let me explain the Out-of-Order Execution in a few words. The fastest code is where instructions do not depend on previous results, e.g., you should always clear whole registers (by movzx) to remove dependency from previous values of the registers you are working with, so they may be renamed internally by the CPU to allow instruction execute in parallel or in a different order. Or, on some processors, false dependency may exist that may also slow things down, like false dependency on Pentium 4 for inc/dec, so you may wish to use add eax, 1 instead or inc eax to remove dependency on the previous state of the flags.
You can read more on Out-of-Order Execution & Register Renaming if time permits. There is plenty of information available on the Internet.
There are also many other essential issues like branch prediction, number of load and store units, number of gates that execute micro-ops, memory cache coherence protocols, etc., but the crucial thing to consider is the Out-of-Order Execution.
Most people are simply not aware of the Out-of-Order Execution. Therefore, they write their assembly programs like for 80286, expecting their instructions will take a fixed time to execute regardless of the context. At the same time, C compilers are aware of the Out-of-Order Execution and generate the code correctly. That's why the code of such uninformed people is slower, but if you become knowledgeable, your code will be faster.
There are also lots of optimization tips and tricks besides the Out-of-Order Execution. Just read the Optimization Manual above mentioned :-)
However, assembly language has its own drawbacks when it comes to optimization. According to Peter Cordes (see the comment below), some of the optimizations compilers do would be unmaintainable for large code-bases in hand-written assembly. For example, suppose you write in assembly. In that case, you need to completely change an inline function (an assembly macro) when it inlines into a function that calls it with some arguments being constants. At the same time, a C compiler makes its job a lot simpler—and inlining the same code in different ways into different call sites. There is a limit to what you can do with assembly macros. So to get the same benefit, you'd have to manually optimize the same logic in each place to match the constants and available registers you have.

I think the general case when assembler is faster is when a smart assembly programmer looks at the compiler's output and says "this is a critical path for performance and I can write this to be more efficient" and then that person tweaks that assembler or rewrites it from scratch.

It all depends on your workload.
For day-to-day operations, C and C++ are just fine, but there are certain workloads (any transforms involving video (compression, decompression, image effects, etc)) that pretty much require assembly to be performant.
They also usually involve using CPU specific chipset extensions (MME/MMX/SSE/whatever) that are tuned for those kinds of operation.

It might be worth looking at Optimizing Immutable and Purity by Walter Bright it's not a profiled test but shows you one good example of a difference between handwritten and compiler generated ASM. Walter Bright writes optimising compilers so it might be worth looking at his other blog posts.

LInux assembly howto, asks this question and gives the pros and cons of using assembly.

I have an operation of transposition of bits that needs to be done, on 192 or 256 bits every interrupt, that happens every 50 microseconds.
It happens by a fixed map(hardware constraints). Using C, it took around 10 microseconds to make. When I translated this to Assembler, taking into account the specific features of this map, specific register caching, and using bit oriented operations; it took less than 3.5 microsecond to perform.

The simple answer... One who knows assembly well (aka has the reference beside him, and is taking advantage of every little processor cache and pipeline feature etc) is guaranteed to be capable of producing much faster code than any compiler.
However the difference these days just doesn't matter in the typical application.

http://cr.yp.to/qhasm.html has many examples.

One of the posibilities to the CP/M-86 version of PolyPascal (sibling to Turbo Pascal) was to replace the "use-bios-to-output-characters-to-the-screen" facility with a machine language routine which in essense was given the x, and y, and the string to put there.
This allowed to update the screen much, much faster than before!
There was room in the binary to embed machine code (a few hundred bytes) and there was other stuff there too, so it was essential to squeeze as much as possible.
It turnes out that since the screen was 80x25 both coordinates could fit in a byte each, so both could fit in a two-byte word. This allowed to do the calculations needed in fewer bytes since a single add could manipulate both values simultaneously.
To my knowledge there is no C compilers which can merge multiple values in a register, do SIMD instructions on them and split them out again later (and I don't think the machine instructions will be shorter anyway).

One of the more famous snippets of assembly is from Michael Abrash's texture mapping loop (expained in detail here):
add edx,[DeltaVFrac] ; add in dVFrac
sbb ebp,ebp ; store carry
mov [edi],al ; write pixel n
mov al,[esi] ; fetch pixel n+1
add ecx,ebx ; add in dUFrac
adc esi,[4*ebp + UVStepVCarry]; add in steps
Nowadays most compilers express advanced CPU specific instructions as intrinsics, i.e., functions that get compiled down to the actual instruction. MS Visual C++ supports intrinsics for MMX, SSE, SSE2, SSE3, and SSE4, so you have to worry less about dropping down to assembly to take advantage of platform specific instructions. Visual C++ can also take advantage of the actual architecture you are targetting with the appropriate /ARCH setting.

Given the right programmer, Assembler programs can always be made faster than their C counterparts (at least marginally). It would be difficult to create a C program where you couldn't take out at least one instruction of the Assembler.

gcc has become a widely used compiler. Its optimizations in general are not that good. Far better than the average programmer writing assembler, but for real performance, not that good. There are compilers that are simply incredible in the code they produce. So as a general answer there are going to be many places where you can go into the output of the compiler and tweak the assembler for performance, and/or simply re-write the routine from scratch.

Longpoke, there is just one limitation: time. When you don't have the resources to optimize every single change to code and spend your time allocating registers, optimize few spills away and what not, the compiler will win every single time. You do your modification to the code, recompile and measure. Repeat if necessary.
Also, you can do a lot in the high-level side. Also, inspecting the resulting assembly may give the IMPRESSION that the code is crap, but in practice it will run faster than what you think would be quicker. Example:
int y = data[i];
// do some stuff here..
call_function(y, ...);
The compiler will read the data, push it to stack (spill) and later read from stack and pass as argument. Sounds shite? It might actually be very effective latency compensation and result in faster runtime.
// optimized version
call_function(data[i], ...); // not so optimized after all..
The idea with the optimized version was, that we have reduced register pressure and avoid spilling. But in truth, the "shitty" version was faster!
Looking at the assembly code, just looking at the instructions and concluding: more instructions, slower, would be a misjudgment.
The thing here to pay attention is: many assembly experts think they know a lot, but know very little. The rules change from architecture to next, too. There is no silver-bullet x86 code, for example, which is always the fastest. These days is better to go by rules-of-thumb:
memory is slow
cache is fast
try to use cached better
how often you going to miss? do you have latency compensation strategy?
you can execute 10-100 ALU/FPU/SSE instructions for one single cache miss
application architecture is important..
.. but it does't help when the problem isn't in the architecture
Also, trusting too much into compiler magically transforming poorly-thought-out C/C++ code into "theoretically optimum" code is wishful thinking. You have to know the compiler and tool chain you use if you care about "performance" at this low-level.
Compilers in C/C++ are generally not very good at re-ordering sub-expressions because the functions have side effects, for starters. Functional languages don't suffer from this caveat but don't fit the current ecosystem that well. There are compiler options to allow relaxed precision rules which allow order of operations to be changed by the compiler/linker/code generator.
This topic is a bit of a dead-end; for most it's not relevant, and the rest, they know what they are doing already anyway.
It all boils down to this: "to understand what you are doing", it's a bit different from knowing what you are doing.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight