Should I optimize my code myself or let the compiler/gcc to do it

Should I optimize my code myself or let the compiler/gcc to do it - c

I am writing a c code and I would like to know if making simple operation like multiplication more CPU friendly makes any difference and the code faster. For example, replacing this line of code:
y = x * 15;
with
y = x << 4;
y -= x;
Does the compiler already do that? Should I use the -O2 option in order to make it happen?

There are two parts to the answer:
No, unless you are writing a very specialized function (e.g. a signal processing function that must execute in 20 clocks) you should not optimize; leave that to the compiler. In general your job is to write readable code, and the compiler will (to it's capability optimize it). Note that the optimization will be different for different processors as their hardware (computational capability) can be very different. For example a shift by N instruction (like that in your code) may take N clocks on a processor with a regular shifter, but it will take a single clock (or less) on processors with a hardware barrel shifter.
Yes, most modern optimizing compilers will optimize (for example replace multiplication by shifting where appropriate) without explicit optimization options.
Summarized, optimize only in rare situations when you already know that the compiler didn't do a good job, it is a problem that must be addressed, you know well how to do better than the compiler, and the resulting increased maintenance cost is worth it.

Optimizing code by hand is almost always an exercise in futility these days, especially in higher level languages. While C is almost assembly, modern compilers have a lot more tricks built into them than most people are aware of.
In addition, unless the code you're optimizing is going to be used a lot, i.e., millions of times in close succession, the work of optimizing the code will cost more time than the savings you achieve.
With that said, the only way to see if your code is measurably faster would be to test it: Put each version in a tight loop and execute it a million (or more) times, and see if there's a noticeable difference.
Note that your optimization is for a specific multiplier - any other operand you use it for is going to yield different results. Because it can't be generalized, there's little likelyhood this optimization will be done by any compiler in all cases - and just looking at the code and not knowing what processor architecture it will be run on, I can't say whether it would be faster or not.

Related

Is there a speed difference at execution between writing n following operations or using a for looping from 0 to n?

Let's say I have to do the same operation op on the n+1 elements of an array arr.
Is there a difference of speed at execution between :
op(arr[0]);
op(arr[1]);
...
op(arr[n]);
and
for(i=0; i<=n; i++) {
op(arr[i]);
}
Or is the compiler smart enough to detect the for loop (given n is fixed at compile time) and convert it to the linear execution (if it is actually faster)?

is the compiler smart enough to ....
Yes. Many compilers are far better at such micro optimizations that coders.
This loop or unrolled loop micro-optimization depends on the possessor - some will be faster one way, some the other. Best to code for clarity and let the compiler optimize. If the compiler does not optimize well, more effective to research a new compiler (or its options) than craft code.
Yet for a select situation and with a very knowledgeable coder, crafted code may be better. But that takes time that could be spent with larger efficiency issues.
Problem with crafted code (faster code but looks strange) include: higher maintenance cost, convoluted-ness make for difficult optimization on another platforms, higher bug rate. So the crafted solution may work faster today, yet slower with the next compiler update.

Bitwise operation over a simple piece of code

Recently I came across a code that can compute the largest number given two numbers using XOR. While this looks nifty, the same thing can be achieved by a simple ternary operator or an if else. Not pertaining to just this example, but do bitwise operations have any advantage over normal code? If so, is this advantage in speed of computation or memory usage? I am assuming in bitwise operations the assembly code will look much simpler than normal code. On a related note, while programming embedded systems which is more efficient?
*Normal code refers to how you'd normally do it. For example a*2 is normal code and I can achieve the same thing with a<<1

do bitwise operations have any advantage over normal code?
Bitwise operations are normal code. Most compilers these days have optimizers that generate the same instruction for a << 1 as for a * 2. On some hardware, especially on low-powered microprocessors, shift operations take fewer CPU cycles than multiplication, but there is hardware on which this makes no difference.
In your specific case there is an advantage, though: the code with XOR avoids branching, which has a great potential of speeding up the code. When there is no branching, CPU can use pipelining to perform the same operations much faster.
when programming embedded systems which is more efficient?
Embedded systems often have less powerful CPUs, so bitwise operations do have an advantage. For example, on 68HC11 CPU multiplication takes 10 cycles, while shifting left takes only 3.
Note, however, that it does not mean that you should be using bitwise operations explicitly. Most compilers, including embedded ones, will convert multiplication by a constant to a sequence of shifts and additions in case it saves CPU cycles.

Bitwise operators generally have the advantage of being constant time, regardless of input values. Conditional moves and branches may be the target of timing attacks in certain applications, such as crypto libraries, while bitwise operations are not subject to such attacks. (Disregarding cache timing attacks, etc.)
Generally, if a processor is capable of pipelining, it would be more efficient to use bitwise operations than conditional moves or branches, bypassing the entire branch prediction problem. This may or may not speed up your resulting code.
You do have to be careful, though, as some operations constitute undefined behavior in C, such as shifting signed integers, etc. For this reason, it may be to your advantage to do things the "normal" way.

On some platforms branches are expensive, so finding a way to get the min(x,y) without branching has some merit. I think this is particularly useful in CUDA, where the pipelines in the hardware are long.
Of course, on other platforms (like ARM) with conditional execution and compilers that emit those op-codes, it boils down to a compare and a conditional move (two instructions) with no pipeline bubble. Almost certainly better than a compare, and a few logical operations.

Since the poster asks it with the Embedded tag listed, I will try to reflect primarily that in my answer.
In short, usually you shouldn't try to be "creative" with your coding, since it becomes harder to understand later! (The old saying, "premature optimization is the root of all evils")
So only do anything alike when you know what you are doing, precisely, and in any other case, try to write the most understandable C code.
Well, this was the general part, now lets get on what such tricks could do, how they could affect the execution time.
First thing, in embedded, it is good to check the disassembly listing. If you use a variant of GCC with -O2 optimizations, you can usually assume it is quite clever understanding what the code is meant to do, and will produce the result which is likely fine. It can even use such tricks by itself figuring out the code, if it "sees" that it will be faster on the target CPU, so you don't need to ruin the understandability of your code with tricks. With other compilers, results may vary, in doubt, the assembly listing should be observed to see if execution times could be improved utilizing such bit hack tricks.
On the usual embedded platform, especially at 8 bits, you don't need to care that much about pipeline (and related, branch mispredictions) since it is short (or nonexistent). So you usually gain nothing by eliminating a conditional at the cost of an arithmetic operation, and could actually ruin performance by utilizing some elaborate hacks.
On faster 32 bit CPUs usually there is a longer pipeline and branch predictor to eliminate flushes (costing many cycles), so eliminating conditionals may pay off. But only if they are of such nature that the branch predictor can not guess them right (such as comparisons on "random" data), otherwise the conditionals may still be the better, taking the most minimal time (single cycle or even "less" if the CPU is capable to process more than one operation per cycle) when they were predicted right.

How can I prove or disprove the efficiency of compilation?

This is an unusual question, but I do hope there's a definitive answer.
There's a longstanding debate in our office about how efficiently compilers generate code, specifically number of instructions. We write code for low power embedded systems with virtually no loops. Therefore, the number of instructions emitted is directly proportional to power consumed.
Much of our code looks like this (notice, no dynamic memory allocation, no system calls, very few function calls, very few loops).
foo += 3 * (77 + bar);
if (baz > 18 - qux)
bar -= 19 + 7 >> spam;
I can compile the above snippet with -O3 and read the assembly, but I couldn't write it myself.
The claim I would like to prove or disprove is that compilers generate code that is 2-4X "fatter" (and therefore consume 2-4X times as much power) compared with hand written assembly code.
I'm interested in any compiler with which you have experience.
From this answer I know that GCC and clang can emit assembly interleaved with the C code with
gcc -g -c -Wa,-alh foo.cc
These answers provide solid basis:
When is assembly faster?
Why do you program in assembly?
How can I measure the efficiency with which a compiler generates code?

Hand assembly can always at least match if not beat the compiler, because at the very least, you can start with the compiler generated assembly code and tweak it to make it better. To really do a good job, you need to understand the CPU architecture (pipeline, functional units, memory hierarchy, out-of-order dispatch units, etc.) so that you can schedule each instruction for maximum efficiency.
Another thing to consider is that the number of instructions is not necessarily directly proportional to performance, whether it is speed or power (see Hennessey and Patterson's Computer Architecture: A Quantitative Approach). Basically, you have to look at how many clock cycles each instruction takes, in addition to the number of instructions (and clock rate) to know how long it will take. To know how much energy will be consumed, you also need to know how much energy each instruction takes.
How the CPU implements each instruction affects how many cycles it takes to execute. As an example, your code sequence has a >> operator. The compiler might translate that to a single ASR instruction, but without knowing the architecture, there is no telling how many clock cycles it might take -- some architectures can do an arbitrary shift in a single cycle, while others need one cycle for each bit shift.
Memory access contributes to the number of cycles and power consumption, too. When there are too many variables to store in registers some of them will have to be stored in memory. If you are accessing off chip memory and have a fairly high CPU clock rate, the memory bus can be pretty power hungry. A longer sequence of instructions that avoids reading from and writing to memory (e.g., by computing the same result twice) can be less expensive.
As several others have suggested, there is no substitute for benchmarking. Assuming you are using a microcontroller-based system with a constant input voltage, your best bet is to measure the current draw of your system with each alternative set of code and see which does best (one way would be with a current probe and a digital storage oscilloscope).
Even if you can always write better assembler than the compiler, there is a cost in development time and maintainability. In The Mythical Man Month Brooks estimated 3-5x more effort at time when many, if not most, programmers wrote code in assembler. Unless your code is really tiny, you are probably best off only coding the most critical parts in assembly. Even so, the person writing the assembly should be able to prove that their (more expensive) code is worth the cost by comparing running code vs. running code.

If the question is "how can I measure the efficiency with which a compiler generates code" (your actual question), the answer is "that depends". It depends on how you define "efficiency". Mostly, compilers are designed to optimize for speed. As you change the optimization level (-O1, -O2, -O3), the compiler will spend more time looking for "clever things to do to make it just a bit faster". This can involve loop unrolling, order of execution, use of registers, and many other things.
It seems that your "efficiency" criterion is not one that compilers are designed for: you say you want "fewest cycles" because you think that == lowest power. However I would argue that "fastest execution" == "shortest time before processor can go into standby mode again". Unless you believe that the power consumption of the processor in "awake" mode changes significantly with instructions executed, I think that it is safe to say that fastest execution == shortest time awake == lowest power consumption.
In which case "fat code" doesn't matter - it's back to speed only. Note also that not all instructions take the same number of clock cycles (although to be fair, that depends on the processor).

EDIT, okay that was fun...
Folks that make the blanket statement that compilers outperform humans, are the ones that have not actually checked. Anything a compiler can create a human can create. But a compiler cannot always create the code a human can create. It is that simple. For projects anywhere from a few lines to a few dozen lines or larger, it becomes easier and easier to hand fix the optimizations made by a compiler. Compiler and target help close that gap but there will always be the educated someone that will be able to meet or exceed the compilers output.
The claim I would like to prove or disprove is that compilers generate
code that is 2-4X "fatter" (and therefore consume 2-4X times as much
power) compared with hand written assembly code.
Unless you are defining "fatter" to mean uses that much power. Size of a binary and power consumption are not related. If this whole question/project is related to power consumption, the compiler wont take into account the bios settings you have chosen (assuming you are talking about pcs), the video card, hard disk, monitor, mouse, keyboard, etc, etc. In addition to the processor which is only one (relatively small) part of the equation. And even if it did would someone make a compiler that only makes your code efficient, they cant and wont tune the compiler for every system on the planet. Aint gonna happen.
If you are talking a mobile phone which is a very controlled environment the app may get tuned to save power, but the compiler is not the master of that, it is the user, the compiler does part of it the rest is hand tuned by the programmer.
I can compile the above snippet with -O3 and read the assembly, but I couldn't write it myself.
If you go into this with that kind of attitude then you have automatically failed. Yes you can meet or beat the compiler, period. It is a matter of confidence and will power and time/effort. That statement means you have not really studied the problem, which is why you are asking the question you are asking. Take some time, do some more research, ask detailed questions at stackoverflow (not open ended ones like this one), and with time you will understand what compilers do and dont do and why, in particular why they are not perfect (for any one or many rulers by which that opinion is defined). This question is wholly about opinion and will spark flame wars, and such and will get closed and eventually removed from this site. Instead write and compile and publish code segments and ask questions about "why did the compiler produce this output, why didnt it to [this] instead?" Those kinds of questions have a better chance at real answers and of staying here for others to learn from.

Micro-optimizations in C, which ones are there? Is there anyone really useful? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I understand most of the micro-optimizations out there but are they really useful?
Exempli gratia: does doing ++i instead of i++, or while(1) or for(;;) really result in performance improvements (either in memory fingerprint or CPU cycles)?
So the question is, what micro-optimizations can be done in C? Are they really useful?

You should rely on your compiler to optimise this stuff. Concentrate on using appropriate algorithms and writing reliable, readable and maintainable code.

The day tclhttpd, a webserver written in Tcl, one of the slowest scripting language, managed to outperform Apache, a webserver written in C, one of the supposedly fastest compiled language, was the day I was convinced that micro-optimizations significantly pales in comparison to using a faster algorithm/technique*.
Never worry about micro-optimizations until you can prove in a debugger that it is the problem. Even then, I would recommend first coming here to SO and ask if it is a good idea hoping someone would convince you not to do it.
It is counter-intuitive but very often code, especially tight nested loops or recursion, are optimized by adding code rather than removing them. The gaming industry has come up with countless tricks to speed up nested loops using filters to avoid unnecessary processing. Those filters add significantly more instructions than the difference between i++ and ++i.
*note: We have learned a lot since then. The realization that a slow scripting language can outperform compiled machine code because spawning threads is expensive led to the developments of lighttpd, NginX and Apache2.

There's a difference, I think, between a micro-optimization, a trick, and alternative means of doing something. It can be a micro-optimization to use ++i instead of i++, though I would think of it as merely avoiding a pessimization, because when you pre-increment (or decrement) the compiler need not insert code to keep track of the current value of the variable for use in the expression. If using pre-increment/decrement doesn't change the semantics of the expression, then you should use it and avoid the overhead.
A trick, on the other hand, is code that uses a non-obvious mechanism to achieve a result faster than a straight-forward mechanism would. Tricks should be avoided unless absolutely needed. Gaining a small percentage of speed-up is generally not worth the damage to code readability unless that small percentage reflects a meaningful amount of time. Extremely long-running programs, especially calculation-heavy ones, or real-time programs are often candidates for tricks because the amount of time saved may be necessary to meet the systems performance goals. Tricks should be clearly documented if used.
Alternatives, are just that. There may be no performance gain or little; they just represent two different ways of expressing the same intent. The compiler may even produce the same code. In this case, choose the most readable expression. I would say to do so even if it results in some performance loss (though see the preceding paragraph).

I think you do not need to think about these micro-optimizations because most of them is done by compiler. These things can only make code more difficult to read.
Remember, [edited] premature [/edited] optimization is an evil.

To be honest, that question, while valid, is not relevant today - why?
Compiler writers are a lot more smarter than they were 20 years ago, rewind back in time, then these optimizations would have been very relevant, we were all working with old 80286/386 processors, and coders would often resort to tricks to squeeze even more bytes out of the compiled code.
Today, processors are too fast, compiler writers knows the intimate details of operand instructions to make every thing work, considering that there is pipe-lining, core processors, acres of RAM, remember, with a 80386 processor, there would be 4Mb RAM and if you're lucky, 8Mb was considered superior!!
The paradigm has shifted, it was about squeezing every byte out of compiled code, now it is more on programmer productivity and getting the release out the door much sooner.
The above I have stated the nature of the processor, and compilers, I was talking about the Intel 80x86 processor family, Borland/Microsoft compilers.
Hope this helps,
Best regards,
Tom.

If you can easily see that two different code sequences produce identical results, without making assumptions about the data other than what's present in the code, then the compiler can too, and generally will.
It's only when the transformation from one to the other is highly non-obvious or requires assuming something that you may know to be true but the compiler has no way to infer (eg. that an operation cannot overflow or that two pointers will never alias, even though they aren't declared with the restrict keyword) that you should spend time thinking about these things. Even then, the best thing to do is usually to find a way to inform the compiler about the assumptions that it can make.
If you do find specific cases where the compiler misses simple transformations, 99% of the time you should just file a bug against the compiler and get on with working on more important things.

Keeping the fact that memory is the new disk in mind will likely improve your performance far more than applying any of those micro-optimizations.

For a slightly more pragmatic take on the question of ++i vs. i++ (at least in a C++ context) see http://llvm.org/docs/CodingStandards.html#micro_preincrement.
If Chris Lattner says it, I've got to pay attention. ;-)

You would do better to consider every program you write primarily as a language in which you communicate your ideas, intentions and reasoning to other human beings who will have to bug-fix, reuse and understand it. They will spend more time on decoding garbled code than any compiler or runtime system will do executing it.
To summarise, say what you mean in the clearest way, using the common idioms of the language in question.
For these specific examples in C, for(;;) is the idiom for an infinite loop and "i++" is the usual idiom for "add one to i" unless you use the value in an expression, in which case it depends whether the value with the clearest meaning is the one before or after the increment.

Here's real optimization, in my experience.
Someone on SO once remarked that micro-optimization was like "getting a haircut to lose weight". On American TV there is a show called "The Biggest Loser" where obese people compete to lose weight. If they were able to get their body weight down to a few grams, then getting a haircut would help.
Maybe that's overstating the analogy to micro-optimization, because I have seen (and written) code where micro-optimization actually did make a difference, but when starting off there is a lot more to be gained by simply not solving problems you don't have.

x ^= y
y ^= x
x ^= y

++i should be prefered over i++ for situations where you don't use the return value because it better represents the semantics of what you are trying to do (increment i) rather than any possible optimisation (it might be slightly faster, and is probably not worse).

Generally, loops that count towards zero are faster than loops that count towards some other number. I can imagine a situation where the compiler can't make this optimization for you, but you can make it yourself.
Say that you have and array of length x, where x is some very big number, and that you need to perform some operation on each element of x. Further, let's say that you don't care what order these operations occur in. You might do this...
int i;
for (i = 0; i < x; i++)
doStuff(array[i]);
But, you could get a little optimization by doing it this way instead -
int i;
for (i = x-1; i != 0; i--)
{
doStuff(array[i]);
}
doStuff(array[0]);
The compiler doesn't do it for you because it can't assume that order is unimportant.
MaR's example code is better. Consider this, assuming doStuff() returns an int:
int i = x;
while (i != 0)
{
--i;
printf("%d\n",doStuff(array[i]));
}
This is ok as long as printing the array contents in reverse order is acceptable, but the compiler can't decide that for you.
This being an optimization is hardware dependent. From what I remember about writing assembler (many, many years ago), counting up rather than counting down to zero requires an extra machine instruction each time you go through the loop.
If your test is something like (x < y), then evaluation of the test goes something like this:
subtract y from x, storing the result in some register r1
test r1, to set the n and z flags
branch based on the values of the n and z flags
If your test is ( x != 0), you can do this:
test x, to set the z flag
branch based on the value of the z flag
You get to skip a subtract instruction for each iteration.
There are architectures where you can have the subtract instruction set the flags based on the result of the subtraction, but I'm pretty sure x86 isn't one of them, so most of us aren't using compilers that have access to such a machine instruction.

When is assembly faster than C? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Closed 1 year ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
One of the stated reasons for knowing assembler is that, on occasion, it can be employed to write code that will be more performant than writing that code in a higher-level language, C in particular. However, I've also heard it stated many times that although that's not entirely false, the cases where assembler can actually be used to generate more performant code are both extremely rare and require expert knowledge of and experience with assembly.
This question doesn't even get into the fact that assembler instructions will be machine-specific and non-portable, or any of the other aspects of assembler. There are plenty of good reasons for knowing assembly besides this one, of course, but this is meant to be a specific question soliciting examples and data, not an extended discourse on assembler versus higher-level languages.
Can anyone provide some specific examples of cases where assembly will be faster than well-written C code using a modern compiler, and can you support that claim with profiling evidence? I am pretty confident these cases exist, but I really want to know exactly how esoteric these cases are, since it seems to be a point of some contention.

Here is a real world example: Fixed point multiplies on old compilers.
These don't only come handy on devices without floating point, they shine when it comes to precision as they give you 32 bits of precision with a predictable error (float only has 23 bit and it's harder to predict precision loss). i.e. uniform absolute precision over the entire range, instead of close-to-uniform relative precision (float).
Modern compilers optimize this fixed-point example nicely, so for more modern examples that still need compiler-specific code, see
Getting the high part of 64 bit integer multiplication: A portable version using uint64_t for 32x32 => 64-bit multiplies fails to optimize on a 64-bit CPU, so you need intrinsics or __int128 for efficient code on 64-bit systems.
_umul128 on Windows 32 bits: MSVC doesn't always do a good job when multiplying 32-bit integers cast to 64, so intrinsics helped a lot.
C doesn't have a full-multiplication operator (2N-bit result from N-bit inputs). The usual way to express it in C is to cast the inputs to the wider type and hope the compiler recognizes that the upper bits of the inputs aren't interesting:
// on a 32-bit machine, int can hold 32-bit fixed-point integers.
int inline FixedPointMul (int a, int b)
{
long long a_long = a; // cast to 64 bit.
long long product = a_long * b; // perform multiplication
return (int) (product >> 16); // shift by the fixed point bias
}
The problem with this code is that we do something that can't be directly expressed in the C-language. We want to multiply two 32 bit numbers and get a 64 bit result of which we return the middle 32 bit. However, in C this multiply does not exist. All you can do is to promote the integers to 64 bit and do a 64*64 = 64 multiply.
x86 (and ARM, MIPS and others) can however do the multiply in a single instruction. Some compilers used to ignore this fact and generate code that calls a runtime library function to do the multiply. The shift by 16 is also often done by a library routine (also the x86 can do such shifts).
So we're left with one or two library calls just for a multiply. This has serious consequences. Not only is the shift slower, registers must be preserved across the function calls and it does not help inlining and code-unrolling either.
If you rewrite the same code in (inline) assembler you can gain a significant speed boost.
In addition to this: using ASM is not the best way to solve the problem. Most compilers allow you to use some assembler instructions in intrinsic form if you can't express them in C. The VS.NET2008 compiler for example exposes the 32*32=64 bit mul as __emul and the 64 bit shift as __ll_rshift.
Using intrinsics you can rewrite the function in a way that the C-compiler has a chance to understand what's going on. This allows the code to be inlined, register allocated, common subexpression elimination and constant propagation can be done as well. You'll get a huge performance improvement over the hand-written assembler code that way.
For reference: The end-result for the fixed-point mul for the VS.NET compiler is:
int inline FixedPointMul (int a, int b)
{
return (int) __ll_rshift(__emul(a,b),16);
}
The performance difference of fixed point divides is even bigger. I had improvements up to factor 10 for division heavy fixed point code by writing a couple of asm-lines.
Using Visual C++ 2013 gives the same assembly code for both ways.
gcc4.1 from 2007 also optimizes the pure C version nicely. (The Godbolt compiler explorer doesn't have any earlier versions of gcc installed, but presumably even older GCC versions could do this without intrinsics.)
See source + asm for x86 (32-bit) and ARM on the Godbolt compiler explorer. (Unfortunately it doesn't have any compilers old enough to produce bad code from the simple pure C version.)
Modern CPUs can do things C doesn't have operators for at all, like popcnt or bit-scan to find the first or last set bit. (POSIX has a ffs() function, but its semantics don't match x86 bsf / bsr. See https://en.wikipedia.org/wiki/Find_first_set).
Some compilers can sometimes recognize a loop that counts the number of set bits in an integer and compile it to a popcnt instruction (if enabled at compile time), but it's much more reliable to use __builtin_popcnt in GNU C, or on x86 if you're only targeting hardware with SSE4.2: _mm_popcnt_u32 from <immintrin.h>.
Or in C++, assign to a std::bitset<32> and use .count(). (This is a case where the language has found a way to portably expose an optimized implementation of popcount through the standard library, in a way that will always compile to something correct, and can take advantage of whatever the target supports.) See also https://en.wikipedia.org/wiki/Hamming_weight#Language_support.
Similarly, ntohl can compile to bswap (x86 32-bit byte swap for endian conversion) on some C implementations that have it.
Another major area for intrinsics or hand-written asm is manual vectorization with SIMD instructions. Compilers are not bad with simple loops like dst[i] += src[i] * 10.0;, but often do badly or don't auto-vectorize at all when things get more complicated. For example, you're unlikely to get anything like How to implement atoi using SIMD? generated automatically by the compiler from scalar code.

Many years ago I was teaching someone to program in C. The exercise was to rotate a graphic through 90 degrees. He came back with a solution that took several minutes to complete, mainly because he was using multiplies and divides etc.
I showed him how to recast the problem using bit shifts, and the time to process came down to about 30 seconds on the non-optimizing compiler he had.
I had just got an optimizing compiler and the same code rotated the graphic in < 5 seconds. I looked at the assembly code that the compiler was generating, and from what I saw decided there and then that my days of writing assembler were over.

Pretty much anytime the compiler sees floating point code, a hand written version will be quicker if you're using an old bad compiler. (2019 update: This is not true in general for modern compilers. Especially when compiling for anything other than x87; compilers have an easier time with SSE2 or AVX for scalar math, or any non-x86 with a flat FP register set, unlike x87's register stack.)
The primary reason is that the compiler can't perform any robust optimisations. See this article from MSDN for a discussion on the subject. Here's an example where the assembly version is twice the speed as the C version (compiled with VS2K5):
#include "stdafx.h"
#include <windows.h>
float KahanSum(const float *data, int n)
{
float sum = 0.0f, C = 0.0f, Y, T;
for (int i = 0 ; i < n ; ++i) {
Y = *data++ - C;
T = sum + Y;
C = T - sum - Y;
sum = T;
}
return sum;
}
float AsmSum(const float *data, int n)
{
float result = 0.0f;
_asm
{
mov esi,data
mov ecx,n
fldz
fldz
l1:
fsubr [esi]
add esi,4
fld st(0)
fadd st(0),st(2)
fld st(0)
fsub st(0),st(3)
fsub st(0),st(2)
fstp st(2)
fstp st(2)
loop l1
fstp result
fstp result
}
return result;
}
int main (int, char **)
{
int count = 1000000;
float *source = new float [count];
for (int i = 0 ; i < count ; ++i) {
source [i] = static_cast <float> (rand ()) / static_cast <float> (RAND_MAX);
}
LARGE_INTEGER start, mid, end;
float sum1 = 0.0f, sum2 = 0.0f;
QueryPerformanceCounter (&start);
sum1 = KahanSum (source, count);
QueryPerformanceCounter (&mid);
sum2 = AsmSum (source, count);
QueryPerformanceCounter (&end);
cout << " C code: " << sum1 << " in " << (mid.QuadPart - start.QuadPart) << endl;
cout << "asm code: " << sum2 << " in " << (end.QuadPart - mid.QuadPart) << endl;
return 0;
}
And some numbers from my PC running a default release build*:
C code: 500137 in 103884668
asm code: 500137 in 52129147
Out of interest, I swapped the loop with a dec/jnz and it made no difference to the timings - sometimes quicker, sometimes slower. I guess the memory limited aspect dwarfs other optimisations. (Editor's note: more likely the FP latency bottleneck is enough to hide the extra cost of loop. Doing two Kahan summations in parallel for the odd/even elements, and adding those at the end, could maybe speed this up by a factor of 2.)
Whoops, I was running a slightly different version of the code and it outputted the numbers the wrong way round (i.e. C was faster!). Fixed and updated the results.

Without giving any specific example or profiler evidence, you can write better assembler than the compiler when you know more than the compiler.
In the general case, a modern C compiler knows much more about how to optimize the code in question: it knows how the processor pipeline works, it can try to reorder instructions quicker than a human can, and so on - it's basically the same as a computer being as good as or better than the best human player for boardgames, etc. simply because it can make searches within the problem space faster than most humans. Although you theoretically can perform as well as the computer in a specific case, you certainly can't do it at the same speed, making it infeasible for more than a few cases (i.e. the compiler will most certainly outperform you if you try to write more than a few routines in assembler).
On the other hand, there are cases where the compiler does not have as much information - I'd say primarily when working with different forms of external hardware, of which the compiler has no knowledge. The primary example probably being device drivers, where assembler combined with a human's intimate knowledge of the hardware in question can yield better results than a C compiler could do.
Others have mentioned special purpose instructions, which is what I'm talking in the paragraph above - instructions of which the compiler might have limited or no knowledge at all, making it possible for a human to write faster code.

In my job, there are three reasons for me to know and use assembly. In order of importance:
Debugging - I often get library code that has bugs or incomplete documentation. I figure out what it's doing by stepping in at the assembly level. I have to do this about once a week. I also use it as a tool to debug problems in which my eyes don't spot the idiomatic error in C/C++/C#. Looking at the assembly gets past that.
Optimizing - the compiler does fairly well in optimizing, but I play in a different ballpark than most. I write image processing code that usually starts with code that looks like this:
for (int y=0; y < imageHeight; y++) {
for (int x=0; x < imageWidth; x++) {
// do something
}
}
the "do something part" typically happens on the order of several million times (ie, between 3 and 30). By scraping cycles in that "do something" phase, the performance gains are hugely magnified. I don't usually start there - I usually start by writing the code to work first, then do my best to refactor the C to be naturally better (better algorithm, less load in the loop etc). I usually need to read assembly to see what's going on and rarely need to write it. I do this maybe every two or three months.
doing something the language won't let me. These include - getting the processor architecture and specific processor features, accessing flags not in the CPU (man, I really wish C gave you access to the carry flag), etc. I do this maybe once a year or two years.

Only when using some special purpose instruction sets the compiler doesn't support.
To maximize the computing power of a modern CPU with multiple pipelines and predictive branching you need to structure the assembly program in a way that makes it a) almost impossible for a human to write b) even more impossible to maintain.
Also, better algorithms, data structures and memory management will give you at least an order of magnitude more performance than the micro-optimizations you can do in assembly.

Although C is "close" to the low-level manipulation of 8-bit, 16-bit, 32-bit, 64-bit data, there are a few mathematical operations not supported by C which can often be performed elegantly in certain assembly instruction sets:
Fixed-point multiplication: The product of two 16-bit numbers is a 32-bit number. But the rules in C says that the product of two 16-bit numbers is a 16-bit number, and the product of two 32-bit numbers is a 32-bit number -- the bottom half in both cases. If you want the top half of a 16x16 multiply or a 32x32 multiply, you have to play games with the compiler. The general method is to cast to a larger-than-necessary bit width, multiply, shift down, and cast back:
int16_t x, y;
// int16_t is a typedef for "short"
// set x and y to something
int16_t prod = (int16_t)(((int32_t)x*y)>>16);`
In this case the compiler may be smart enough to know that you're really just trying to get the top half of a 16x16 multiply and do the right thing with the machine's native 16x16multiply. Or it may be stupid and require a library call to do the 32x32 multiply that's way overkill because you only need 16 bits of the product -- but the C standard doesn't give you any way to express yourself.
Certain bitshifting operations (rotation/carries):
// 256-bit array shifted right in its entirety:
uint8_t x[32];
for (int i = 32; --i > 0; )
{
x[i] = (x[i] >> 1) | (x[i-1] << 7);
}
x[0] >>= 1;
This is not too inelegant in C, but again, unless the compiler is smart enough to realize what you are doing, it's going to do a lot of "unnecessary" work. Many assembly instruction sets allow you to rotate or shift left/right with the result in the carry register, so you could accomplish the above in 34 instructions: load a pointer to the beginning of the array, clear the carry, and perform 32 8-bit right-shifts, using auto-increment on the pointer.
For another example, there are linear feedback shift registers (LFSR) that are elegantly performed in assembly: Take a chunk of N bits (8, 16, 32, 64, 128, etc), shift the whole thing right by 1 (see above algorithm), then if the resulting carry is 1 then you XOR in a bit pattern that represents the polynomial.
Having said that, I wouldn't resort to these techniques unless I had serious performance constraints. As others have said, assembly is much harder to document/debug/test/maintain than C code: the performance gain comes with some serious costs.
edit: 3. Overflow detection is possible in assembly (can't really do it in C), this makes some algorithms much easier.

Short answer? Sometimes.
Technically every abstraction has a cost and a programming language is an abstraction for how the CPU works. C however is very close. Years ago I remember laughing out loud when I logged onto my UNIX account and got the following fortune message (when such things were popular):
The C Programming Language -- A
language which combines the
flexibility of assembly language with
the power of assembly language.
It's funny because it's true: C is like portable assembly language.
It's worth noting that assembly language just runs however you write it. There is however a compiler in between C and the assembly language it generates and that is extremely important because how fast your C code is has an awful lot to do with how good your compiler is.
When gcc came on the scene one of the things that made it so popular was that it was often so much better than the C compilers that shipped with many commercial UNIX flavours. Not only was it ANSI C (none of this K&R C rubbish), was more robust and typically produced better (faster) code. Not always but often.
I tell you all this because there is no blanket rule about the speed of C and assembler because there is no objective standard for C.
Likewise, assembler varies a lot depending on what processor you're running, your system spec, what instruction set you're using and so on. Historically there have been two CPU architecture families: CISC and RISC. The biggest player in CISC was and still is the Intel x86 architecture (and instruction set). RISC dominated the UNIX world (MIPS6000, Alpha, Sparc and so on). CISC won the battle for the hearts and minds.
Anyway, the popular wisdom when I was a younger developer was that hand-written x86 could often be much faster than C because the way the architecture worked, it had a complexity that benefitted from a human doing it. RISC on the other hand seemed designed for compilers so noone (I knew) wrote say Sparc assembler. I'm sure such people existed but no doubt they've both gone insane and been institutionalized by now.
Instruction sets are an important point even in the same family of processors. Certain Intel processors have extensions like SSE through SSE4. AMD had their own SIMD instructions. The benefit of a programming language like C was someone could write their library so it was optimized for whichever processor you were running on. That was hard work in assembler.
There are still optimizations you can make in assembler that no compiler could make and a well written assembler algoirthm will be as fast or faster than it's C equivalent. The bigger question is: is it worth it?
Ultimately though assembler was a product of its time and was more popular at a time when CPU cycles were expensive. Nowadays a CPU that costs $5-10 to manufacture (Intel Atom) can do pretty much anything anyone could want. The only real reason to write assembler these days is for low level things like some parts of an operating system (even so the vast majority of the Linux kernel is written in C), device drivers, possibly embedded devices (although C tends to dominate there too) and so on. Or just for kicks (which is somewhat masochistic).

I'm surprised no one said this. The strlen() function is much faster if written in assembly! In C, the best thing you can do is
int c;
for(c = 0; str[c] != '\0'; c++) {}
while in assembly you can speed it up considerably:
mov esi, offset string
mov edi, esi
xor ecx, ecx
lp:
mov ax, byte ptr [esi]
cmp al, cl
je end_1
cmp ah, cl
je end_2
mov bx, byte ptr [esi + 2]
cmp bl, cl
je end_3
cmp bh, cl
je end_4
add esi, 4
jmp lp
end_4:
inc esi
end_3:
inc esi
end_2:
inc esi
end_1:
inc esi
mov ecx, esi
sub ecx, edi
the length is in ecx. This compares 4 characters at time, so it's 4 times faster. And think using the high order word of eax and ebx, it will become 8 times faster that the previous C routine!

A use case which might not apply anymore but for your nerd pleasure: On the Amiga, the CPU and the graphics/audio chips would fight for accessing a certain area of RAM (the first 2MB of RAM to be specific). So when you had only 2MB RAM (or less), displaying complex graphics plus playing sound would kill the performance of the CPU.
In assembler, you could interleave your code in such a clever way that the CPU would only try to access the RAM when the graphics/audio chips were busy internally (i.e. when the bus was free). So by reordering your instructions, clever use of the CPU cache, the bus timing, you could achieve some effects which were simply not possible using any higher level language because you had to time every command, even insert NOPs here and there to keep the various chips out of each others radar.
Which is another reason why the NOP (No Operation - do nothing) instruction of the CPU can actually make your whole application run faster.
[EDIT] Of course, the technique depends on a specific hardware setup. Which was the main reason why many Amiga games couldn't cope with faster CPUs: The timing of the instructions was off.

Point one which is not the answer.
Even if you never program in it, I find it useful to know at least one assembler instruction set. This is part of the programmers never-ending quest to know more and therefore be better. Also useful when stepping into frameworks you don't have the source code to and having at least a rough idea what is going on. It also helps you to understand JavaByteCode and .Net IL as they are both similar to assembler.
To answer the question when you have a small amount of code or a large amount of time. Most useful for use in embedded chips, where low chip complexity and poor competition in compilers targeting these chips can tip the balance in favour of humans. Also for restricted devices you are often trading off code size/memory size/performance in a way that would be hard to instruct a compiler to do. e.g. I know this user action is not called often so I will have small code size and poor performance, but this other function that look similar is used every second so I will have a larger code size and faster performance. That is the sort of trade off a skilled assembly programmer can use.
I would also like to add there is a lot of middle ground where you can code in C compile and examine the Assembly produced, then either change you C code or tweak and maintain as assembly.
My friend works on micro controllers, currently chips for controlling small electric motors. He works in a combination of low level c and Assembly. He once told me of a good day at work where he reduced the main loop from 48 instructions to 43. He is also faced with choices like the code has grown to fill the 256k chip and the business is wanting a new feature, do you
Remove an existing feature
Reduce the size of some or all of the existing features maybe at the cost of performance.
Advocate moving to a larger chip with a higher cost, higher power consumption and larger form factor.
I would like to add as a commercial developer with quite a portfolio or languages, platforms, types of applications I have never once felt the need to dive into writing assembly. I have how ever always appreciated the knowledge I gained about it. And sometimes debugged into it.
I know I have far more answered the question "why should I learn assembler" but I feel it is a more important question then when is it faster.
so lets try once more
You should be thinking about assembly
working on low level operating system function
Working on a compiler.
Working on an extremely limited chip, embedded system etc
Remember to compare your assembly to compiler generated to see which is faster/smaller/better.
David.

Matrix operations using SIMD instructions is probably faster than compiler generated code.

A few examples from my experience:
Access to instructions that are not accessible from C. For instance, many architectures (like x86-64, IA-64, DEC Alpha, and 64-bit MIPS or PowerPC) support a 64 bit by 64 bit multiplication producing a 128 bit result. GCC recently added an extension providing access to such instructions, but before that assembly was required. And access to this instruction can make a huge difference on 64-bit CPUs when implementing something like RSA - sometimes as much as a factor of 4 improvement in performance.
Access to CPU-specific flags. The one that has bitten me a lot is the carry flag; when doing a multiple-precision addition, if you don't have access to the CPU carry bit one must instead compare the result to see if it overflowed, which takes 3-5 more instructions per limb; and worse, which are quite serial in terms of data accesses, which kills performance on modern superscalar processors. When processing thousands of such integers in a row, being able to use addc is a huge win (there are superscalar issues with contention on the carry bit as well, but modern CPUs deal pretty well with it).
SIMD. Even autovectorizing compilers can only do relatively simple cases, so if you want good SIMD performance it's unfortunately often necessary to write the code directly. Of course you can use intrinsics instead of assembly but once you're at the intrinsics level you're basically writing assembly anyway, just using the compiler as a register allocator and (nominally) instruction scheduler. (I tend to use intrinsics for SIMD simply because the compiler can generate the function prologues and whatnot for me so I can use the same code on Linux, OS X, and Windows without having to deal with ABI issues like function calling conventions, but other than that the SSE intrinsics really aren't very nice - the Altivec ones seem better though I don't have much experience with them). As examples of things a (current day) vectorizing compiler can't figure out, read about bitslicing AES or SIMD error correction - one could imagine a compiler that could analyze algorithms and generate such code, but it feels to me like such a smart compiler is at least 30 years away from existing (at best).
On the other hand, multicore machines and distributed systems have shifted many of the biggest performance wins in the other direction - get an extra 20% speedup writing your inner loops in assembly, or 300% by running them across multiple cores, or 10000% by running them across a cluster of machines. And of course high level optimizations (things like futures, memoization, etc) are often much easier to do in a higher level language like ML or Scala than C or asm, and often can provide a much bigger performance win. So, as always, there are tradeoffs to be made.

I can't give the specific examples because it was too many years ago, but there were plenty of cases where hand-written assembler could out-perform any compiler. Reasons why:
You could deviate from calling conventions, passing arguments in registers.
You could carefully consider how to use registers, and avoid storing variables in memory.
For things like jump tables, you could avoid having to bounds-check the index.
Basically, compilers do a pretty good job of optimizing, and that is nearly always "good enough", but in some situations (like graphics rendering) where you're paying dearly for every single cycle, you can take shortcuts because you know the code, where a compiler could not because it has to be on the safe side.
In fact, I have heard of some graphics rendering code where a routine, like a line-draw or polygon-fill routine, actually generated a small block of machine code on the stack and executed it there, so as to avoid continual decision-making about line style, width, pattern, etc.
That said, what I want a compiler to do is generate good assembly code for me but not be too clever, and they mostly do that. In fact, one of the things I hate about Fortran is its scrambling the code in an attempt to "optimize" it, usually to no significant purpose.
Usually, when apps have performance problems, it is due to wasteful design. These days, I would never recommend assembler for performance unless the overall app had already been tuned within an inch of its life, still was not fast enough, and was spending all its time in tight inner loops.
Added: I've seen plenty of apps written in assembly language, and the main speed advantage over a language like C, Pascal, Fortran, etc. was because the programmer was far more careful when coding in assembler. He or she is going to write roughly 100 lines of code a day, regardless of language, and in a compiler language that's going to equal 3 or 400 instructions.

More often than you think, C needs to do things that seem to be unneccessary from an Assembly coder's point of view just because the C standards say so.
Integer promotion, for example. If you want to shift a char variable in C, one would usually expect that the code would do in fact just that, a single bit shift.
The standards, however, enforce the compiler to do a sign extend to int before the shift and truncate the result to char afterwards which might complicate code depending on the target processor's architecture.

You don't actually know whether your well-written C code is really fast if you haven't looked at the disassembly of what compiler produces. Many times you look at it and see that "well-written" was subjective.
So it's not necessary to write in assembler to get fastest code ever, but it's certainly worth to know assembler for the very same reason.

Tight loops, like when playing with images, since an image may cosist of millions of pixels. Sitting down and figuring out how to make best use of the limited number of processor registers can make a difference. Here's a real life sample:
http://danbystrom.se/2008/12/22/optimizing-away-ii/
Then often processors have some esoteric instructions which are too specialized for a compiler to bother with, but on occasion an assembler programmer can make good use of them. Take the XLAT instruction for example. Really great if you need to do table look-ups in a loop and the table is limited to 256 bytes!
Updated: Oh, just come to think of what's most crucial when we speak of loops in general: the compiler has often no clue on how many iterations that will be the common case! Only the programmer know that a loop will be iterated MANY times and that it therefore will be beneficial to prepare for the loop with some extra work, or if it will be iterated so few times that the set-up actually will take longer than the iterations expected.

I have read all the answers (more than 30) and didn't find a simple reason: assembler is faster than C if you have read and practiced the Intel® 64 and IA-32 Architectures Optimization Reference Manual, so the reason why assembly may be slower is that people who write such slower assembly didn't read the Optimization Manual.
In the good old days of Intel 80286, each instruction was executed at a fixed count of CPU cycles. Still, since Pentium Pro, released in 1995, Intel processors became superscalar, utilizing Complex Pipelining: Out-of-Order Execution & Register Renaming. Before that, on Pentium, produced in 1993, there were U and V pipelines. Therefore, Pentium introduced dual pipelines that could execute two simple instructions at one clock cycle if they didn't depend on one another. However, this was nothing compared with the Out-of-Order Execution & Register Renaming that appeared in Pentium Pro. This approach introduced in Pentium Pro is practically the same nowadays on most recent Intel processors.
Let me explain the Out-of-Order Execution in a few words. The fastest code is where instructions do not depend on previous results, e.g., you should always clear whole registers (by movzx) to remove dependency from previous values of the registers you are working with, so they may be renamed internally by the CPU to allow instruction execute in parallel or in a different order. Or, on some processors, false dependency may exist that may also slow things down, like false dependency on Pentium 4 for inc/dec, so you may wish to use add eax, 1 instead or inc eax to remove dependency on the previous state of the flags.
You can read more on Out-of-Order Execution & Register Renaming if time permits. There is plenty of information available on the Internet.
There are also many other essential issues like branch prediction, number of load and store units, number of gates that execute micro-ops, memory cache coherence protocols, etc., but the crucial thing to consider is the Out-of-Order Execution.
Most people are simply not aware of the Out-of-Order Execution. Therefore, they write their assembly programs like for 80286, expecting their instructions will take a fixed time to execute regardless of the context. At the same time, C compilers are aware of the Out-of-Order Execution and generate the code correctly. That's why the code of such uninformed people is slower, but if you become knowledgeable, your code will be faster.
There are also lots of optimization tips and tricks besides the Out-of-Order Execution. Just read the Optimization Manual above mentioned :-)
However, assembly language has its own drawbacks when it comes to optimization. According to Peter Cordes (see the comment below), some of the optimizations compilers do would be unmaintainable for large code-bases in hand-written assembly. For example, suppose you write in assembly. In that case, you need to completely change an inline function (an assembly macro) when it inlines into a function that calls it with some arguments being constants. At the same time, a C compiler makes its job a lot simpler—and inlining the same code in different ways into different call sites. There is a limit to what you can do with assembly macros. So to get the same benefit, you'd have to manually optimize the same logic in each place to match the constants and available registers you have.

I think the general case when assembler is faster is when a smart assembly programmer looks at the compiler's output and says "this is a critical path for performance and I can write this to be more efficient" and then that person tweaks that assembler or rewrites it from scratch.

It all depends on your workload.
For day-to-day operations, C and C++ are just fine, but there are certain workloads (any transforms involving video (compression, decompression, image effects, etc)) that pretty much require assembly to be performant.
They also usually involve using CPU specific chipset extensions (MME/MMX/SSE/whatever) that are tuned for those kinds of operation.

It might be worth looking at Optimizing Immutable and Purity by Walter Bright it's not a profiled test but shows you one good example of a difference between handwritten and compiler generated ASM. Walter Bright writes optimising compilers so it might be worth looking at his other blog posts.

LInux assembly howto, asks this question and gives the pros and cons of using assembly.

I have an operation of transposition of bits that needs to be done, on 192 or 256 bits every interrupt, that happens every 50 microseconds.
It happens by a fixed map(hardware constraints). Using C, it took around 10 microseconds to make. When I translated this to Assembler, taking into account the specific features of this map, specific register caching, and using bit oriented operations; it took less than 3.5 microsecond to perform.

The simple answer... One who knows assembly well (aka has the reference beside him, and is taking advantage of every little processor cache and pipeline feature etc) is guaranteed to be capable of producing much faster code than any compiler.
However the difference these days just doesn't matter in the typical application.

http://cr.yp.to/qhasm.html has many examples.

One of the posibilities to the CP/M-86 version of PolyPascal (sibling to Turbo Pascal) was to replace the "use-bios-to-output-characters-to-the-screen" facility with a machine language routine which in essense was given the x, and y, and the string to put there.
This allowed to update the screen much, much faster than before!
There was room in the binary to embed machine code (a few hundred bytes) and there was other stuff there too, so it was essential to squeeze as much as possible.
It turnes out that since the screen was 80x25 both coordinates could fit in a byte each, so both could fit in a two-byte word. This allowed to do the calculations needed in fewer bytes since a single add could manipulate both values simultaneously.
To my knowledge there is no C compilers which can merge multiple values in a register, do SIMD instructions on them and split them out again later (and I don't think the machine instructions will be shorter anyway).

One of the more famous snippets of assembly is from Michael Abrash's texture mapping loop (expained in detail here):
add edx,[DeltaVFrac] ; add in dVFrac
sbb ebp,ebp ; store carry
mov [edi],al ; write pixel n
mov al,[esi] ; fetch pixel n+1
add ecx,ebx ; add in dUFrac
adc esi,[4*ebp + UVStepVCarry]; add in steps
Nowadays most compilers express advanced CPU specific instructions as intrinsics, i.e., functions that get compiled down to the actual instruction. MS Visual C++ supports intrinsics for MMX, SSE, SSE2, SSE3, and SSE4, so you have to worry less about dropping down to assembly to take advantage of platform specific instructions. Visual C++ can also take advantage of the actual architecture you are targetting with the appropriate /ARCH setting.

Given the right programmer, Assembler programs can always be made faster than their C counterparts (at least marginally). It would be difficult to create a C program where you couldn't take out at least one instruction of the Assembler.

gcc has become a widely used compiler. Its optimizations in general are not that good. Far better than the average programmer writing assembler, but for real performance, not that good. There are compilers that are simply incredible in the code they produce. So as a general answer there are going to be many places where you can go into the output of the compiler and tweak the assembler for performance, and/or simply re-write the routine from scratch.

Longpoke, there is just one limitation: time. When you don't have the resources to optimize every single change to code and spend your time allocating registers, optimize few spills away and what not, the compiler will win every single time. You do your modification to the code, recompile and measure. Repeat if necessary.
Also, you can do a lot in the high-level side. Also, inspecting the resulting assembly may give the IMPRESSION that the code is crap, but in practice it will run faster than what you think would be quicker. Example:
int y = data[i];
// do some stuff here..
call_function(y, ...);
The compiler will read the data, push it to stack (spill) and later read from stack and pass as argument. Sounds shite? It might actually be very effective latency compensation and result in faster runtime.
// optimized version
call_function(data[i], ...); // not so optimized after all..
The idea with the optimized version was, that we have reduced register pressure and avoid spilling. But in truth, the "shitty" version was faster!
Looking at the assembly code, just looking at the instructions and concluding: more instructions, slower, would be a misjudgment.
The thing here to pay attention is: many assembly experts think they know a lot, but know very little. The rules change from architecture to next, too. There is no silver-bullet x86 code, for example, which is always the fastest. These days is better to go by rules-of-thumb:
memory is slow
cache is fast
try to use cached better
how often you going to miss? do you have latency compensation strategy?
you can execute 10-100 ALU/FPU/SSE instructions for one single cache miss
application architecture is important..
.. but it does't help when the problem isn't in the architecture
Also, trusting too much into compiler magically transforming poorly-thought-out C/C++ code into "theoretically optimum" code is wishful thinking. You have to know the compiler and tool chain you use if you care about "performance" at this low-level.
Compilers in C/C++ are generally not very good at re-ordering sub-expressions because the functions have side effects, for starters. Functional languages don't suffer from this caveat but don't fit the current ecosystem that well. There are compiler options to allow relaxed precision rules which allow order of operations to be changed by the compiler/linker/code generator.
This topic is a bit of a dead-end; for most it's not relevant, and the rest, they know what they are doing already anyway.
It all boils down to this: "to understand what you are doing", it's a bit different from knowing what you are doing.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight