Prefetching data to cache for x86-64

Prefetching data to cache for x86-64 - c

In my application, at one point I need to perform calculations on a large contiguous block of memory data (100s of MBs). What I was thinking was to keep prefetching the part of the block my program will touch in future, so that when I perform calculations on that portion, the data is already in the cache.
Can someone give me a simple example of how to achieve this with gcc? I read _mm_prefetch somewhere, but don't know how to properly use it. Also note that I have a multicore system, but each core will be working on a different region of memory in parallel.

gcc uses builtin functions as an interface for lowlevel instructions. In particular for your case __builtin_prefetch. But you only should see a measurable difference when using this in cases where the access pattern is not easy to predict automatically.

Modern CPUs have pretty good automatic prefetch and you may well find that you do more harm than good if you try to initiate software prefetching. There is most likely a lot more "low hanging fruit" that you can focus on for optimisation if you find that you actually have a performance problem. Prefetch tends to be one of the last things that you might try, when you're desperate for a few more percent throughput.

Related

How to use software prefetch systematically?

After reading the accepted answer in When should we use prefetch? and examples from Prefetching Examples?, I still have a lot of issues with understanding when to actually use prefetch. While those answers provide an example where prefetch is helpful, they do not explain how to discover it in real programs. It looks like random guessing.
In particular, I am interested in the C implementations for intel x86 (prefetchnta, prefetcht2, prefetcht1, prefetcht0, prefetchw) that are accessible through GCC's __builtin_prefetch intrinsic. I would like to know:
How can I see that software prefetch can help for my specific program? I imagine that I can collect CPU profiling metrics (e.g. number of cache misses) with Intel Vtune or Linux utility perf. In this case what metrics (or relation between them) indicate the opportunity to improve performance with software prefetching?
How I can locate the loads that suffer from cache misses the most?
How to see the cache level where misses happen to decide which prefetch(0,1,2) to use?
Assuming I found a particular load that suffers from the miss in a specific cache level, where should I place prefetch? As an example, assume that the next loop suffers from cache misses
for (int i = 0; i < n; i++) {
// some code
double x = a[i];
// some code
}
Should I place prefetch before or after the load a[i]? How far ahead it should point a[i+m]? Do I need to worry about unrolling the loop to make sure that I am prefetching only on cache line boundaries or it will be almost free like a nop if data is already in cache? Is it worth to use multiple __builtin_prefetch calls in a row to prefetch multiple cache lines at once?

How can I see that software prefetch can help for my specific program?
You can check the proportion of cache misses. perf or VTune can be used to get this information thanks to hardware performance counters. You can get the list with perf list for example. The list is dependent of the target processor architecture but there are some generic events. For example, L1-dcache-load-misses, LLC-load-misses and LLC-store-misses. Having the amount of cache misses is not very useful unless you also get the number of load/store. There are generic counters like L1-dcache-loads, LLC-loads or LLC-stores. AFAIK, for the L2, there is no generic counters (at least on Intel processors) and you need to use specific hardware counters (for example l2_rqsts.miss on Intel Skylake-like processors). To get the overall statistics, you can use perf stat -e an_hardware_counter,another_one your_program. A good documentation can be found here.
When the proportion of misses is big, then you should try to optimize the code, but this is just a hint. In fact, regarding your application, you can have a lot of cache hit but many cache misses in critical part/time of your application. As a result, cache misses can be lost among all the others. This is especially true for the L1 cache references that are massive in scalar codes compared to SIMD ones. One solution is to profile only specific portion of your application and use the knowledge of it so to investigate in the good direction. Performance counters are not really a tool to automatically search issues in your program, but a tool to assist you in validating/disproving some hypothesis or to give some hints about what is happening. It gives you evidences to solve a mysterious case but it is up to you, the detective, to do all the work.
How I can locate the loads that suffer from cache misses the most?
Some hardware performance counters are "precise" meaning that the instruction that has generated the event can be located. This is very useful since you can tell which instructions are responsible for the most cache misses (though it is not always precise in practice). You can use perf record + perf report so to get the information (see the previous tutorial for more information).
Note that there are many reasons that can cause a cache misses and only few cases can be solved by using software prefetching.
How to see the cache level where misses happen to decide which prefetch(0,1,2) to use?
This is often difficult to choose in practice and very dependent of your application. Theoretically, the number is an hint to tell to the processor if the level of locality of the target cache line (eg. fetched into the L1, L2 or L3 cache). For example, if you know that data should be read and reused soon, it is a good idea to put it in the L1. However, if the L1 is used and you do not want to pollute it with data used only once (or rarely used), it is better to fetch data into lower caches. In practice, it is a bit complex since the behavior may not be the same from one architecture to another... See What are _mm_prefetch() locality hints? for more information.
An example of usage is for this question. Software prefetching was used to avoid cache trashing issue with some specific strides. This is a pathological case where the hardware prefetcher is not very useful.
Assuming I found a particular load that suffers from the miss in a specific cache level, where should I place prefetch?
This is clearly the most tricky part. You should prefetch the cache lines sufficiently early so for the latency to be significantly reduced, otherwise the instruction is useless and can actually be detrimental. Indeed, the instruction takes some space in the program, need to be decoded, and use load ports that could be used to execute other (more critical) load instructions for example. However, if it is too late, then the cache line can be evicted and need to be reloaded...
The usual solution is to write a code like this:
for (int i = 0; i < n; i++) {
// some code
const size_t magic_distance_guess = 200;
__builtin_prefetch(&data[i+magic_distance_guess]);
double x = a[i];
// some code
}
Where magic_distance_guess is a value generally set based on benchmarks (or a very deep understanding of the target platform though the practice often shows even highly-skilled developers fail to find the best value).
The thing is the latency is very dependent of where data are coming from and the target platform. In most case, developers cannot really know exactly when to do the prefetching unless they work on a unique given target platform. This makes software prefetching tricky to use and often detrimental when the target platform changes (one has to consider the maintainability of the code and the overhead of the instruction). Not to mention that built-ins are compiler-dependent, prefetching intrinsics are architecture-dependent and there is no standard portable way to use software prefetching.
Do I need to worry about unrolling the loop to make sure that I am prefetching only on cache line boundaries or it will be almost free like a nop if data is already in cache?
Yes, prefetching instructions are not free and so it is better to use only 1 instruction per cache line (as other prefetching instruction on the same cache line will be useless).
Is it worth to use multiple __builtin_prefetch calls in a row to prefetch multiple cache lines at once?
This is very dependent of the target platform. Modern mainstream x86-64 processors execute instructions in an out-of-order way in parallel and they have a pretty huge window of instruction analyzed. They tends to execute load as soon as possible so to avoid misses and they are often very good for such job.
In your example loop, I expect the hardware prefetcher should do a very good job and using software prefetching should be slower on a (relatively recent) mainstream processor.
Software prefetching was useful when hardware prefetchers was not very smart a decade ago but they tends to be very good nowadays. Additionally, it is often better to guide hardware prefetchers than to use software prefetching instructions since the former have a lower overhead. This is why software prefetching is discouraged (eg. by Intel and most developers) unless you really know what you are doing.

How to use software prefetch systematically?
The quick answer is: don't.
As you correctly analyzed, prefetching is a tricky and advanced optimisation technique that is not portable and rarely useful.
You can use profiling to determine what sections of code form a bottleneck and use specialized tools such as valgrind to try and identify cache misses that could potentially be avoided using compiler builtins.
Don't expect too much from this, but do profile the code to concentrate your optimizing efforts where it can be useful.
Remember also that a better algorithm can beat an optimized implementation of a less efficient one for large datasets.

C- Why is for loop pointer indexing faster? [duplicate]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Some years ago I was on a panel that was interviewing candidates for a relatively senior embedded C programmer position.
One of the standard questions that I asked was about optimisation techniques. I was quite surprised that some of the candidates didn't have answers.
So, in the interests of putting together a list for posterity - what techniques and constructs do you normally use when optimising C programs?
Answers to optimisation for speed and size both accepted.

First things first - don't optimise too early. It's not uncommon to spend time carefully optimising a chunk of code only to find that it wasn't the bottleneck that you thought it was going to be. Or, to put it another way "Before you make it fast, make it work"
Investigate whether there's any option for optimising the algorithm before optimising the code. It'll be easier to find an improvement in performance by optimising a poor algorithm than it is to optimise the code, only then to throw it away when you change the algorithm anyway.
And work out why you need to optimise in the first place. What are you trying to achieve? If you're trying, say, to improve the response time to some event work out if there is an opportunity to change the order of execution to minimise the time critical areas. For example when trying to improve the response to some external interrupt can you do any preparation in the dead time between events?
Once you've decided that you need to optimise the code, which bit do you optimise? Use a profiler. Focus your attention (first) on the areas that are used most often.
So what can you do about those areas?
minimise condition checking. Checking conditions (eg. terminating conditions for loops) is time that isn't being spent on actual processing. Condition checking can be minimised with techniques like loop-unrolling.
In some circumstances condition checking can also be eliminated by using function pointers. For example if you are implementing a state machine you may find that implementing the handlers for individual states as small functions (with a uniform prototype) and storing the "next state" by storing the function pointer of the next handler is more efficient than using a large switch statement with the handler code implemented in the individual case statements. YMMV.
minimise function calls. Function calls usually carry a burden of context saving (eg. writing local variables contained in registers to the stack, saving the stack pointer), so if you don't have to make a call this is time saved. One option (if you're optimising for speed and not space) is to make use of inline functions.
If function calls are unavoidable minimise the data that is being passed to the functions. For example passing pointers is likely to be more efficient than passing structures.
When optimising for speed choose datatypes that are the native size for your platform. For example on a 32bit processor it is likely to be more efficient to manipulate 32bit values than 8 or 16 bit values. (side note - it is worth checking that the compiler is doing what you think it is. I've had situations where I've discovered that my compiler insisted on doing 16 bit arithmetic on 8 bit values with all of the to and from conversions to go with them)
Find data that can be precalculated, and either calculate during initialisation or (better yet) at compile time. For example when implementing a CRC you can either calculate your CRC values on the fly (using the polynomial directly) which is great for size (but dreadful for performance), or you can generate a table of all of the interim values - which is a much faster implementation, to the detriment of the size.
Localise your data. If you're manipulating a blob of data often your processor may be able to speed things up by storing it all in cache. And your compiler may be able to use shorter instructions that are suited to more localised data (eg. instructions that use 8 bit offsets instead of 32 bit)
In the same vein, localise your functions. For the same reasons.
Work out the assumptions that you can make about the operations that you're performing and find ways of exploiting them. For example, on an 8 bit platform if the only operation that at you're doing on a 32 bit value is an increment you may find that you can do better than the compiler by inlining (or creating a macro) specifically for this purpose, rather than using a normal arithmetic operation.
Avoid expensive instructions - division is a prime example.
The "register" keyword can be your friend (although hopefully your compiler has a pretty good idea about your register usage). If you're going to use "register" it's likely that you'll have to declare the local variables that you want "register"ed first.
Be consistent with your data types. If you are doing arithmetic on a mixture of data types (eg. shorts and ints, doubles and floats) then the compiler is adding implicit type conversions for each mismatch. This is wasted cpu cycles that may not be necessary.
Most of the options listed above can be used as part of normal practice without any ill effects. However if you're really trying to eke out the best performance:
- Investigate where you can (safely) disable error checking. It's not recommended, but it will save you some space and cycles.
- Hand craft portions of your code in assembler. This of course means that your code is no longer portable but where that's not an issue you may find savings here. Be aware though that there is potentially time lost moving data into and out of the registers that you have at your disposal (ie. to satisfy the register usage of your compiler). Also be aware that your compiler should be doing a pretty good job on its own. (of course there are exceptions)

As everybody else has said: profile, profile profile.
As for actual techniques, one that I don't think has been mentioned yet:
Hot & Cold Data Separation: Staying within the CPU's cache is incredibly important. One way of helping to do this is by splitting your data structures into frequently accessed ("hot") and rarely accessed ("cold") sections.
An example: Suppose you have a structure for a customer that looks something like this:
struct Customer
{
int ID;
int AccountNumber;
char Name[128];
char Address[256];
};
Customer customers[1000];
Now, lets assume that you want to access the ID and AccountNumber a lot, but not so much the name and address. What you'd do is to split it into two:
struct CustomerAccount
{
int ID;
int AccountNumber;
CustomerData *pData;
};
struct CustomerData
{
char Name[128];
char Address[256];
};
CustomerAccount customers[1000];
In this way, when you're looping through your "customers" array, each entry is only 12 bytes and so you can fit many more entries in the cache. This can be a huge win if you can apply it to situations like the inner loop of a rendering engine.

My favorite technique is to use a good profiler. Without a good profile telling you where the bottleneck lies, no tricks and techniques are going to help you.

most common techniques I encountered are:
loop unrolling
loop optimization for better cache prefetch
(i.e. do N operations in M cycles instead of NxM singular operations)
data aligning
inline functions
hand-crafted asm snippets
As for general recommendations, most of them are already sounded:
choose better algos
use profiler
don't optimize if it doesn't give 20-30% performance boost

For low-level optimization:
START_TIMER/STOP_TIMER macros from ffmpeg (clock-level accuracy for measurement of any code).
Oprofile, of course, for profiling.
Enormous amounts of hand-coded assembly (just do a wc -l on x264's /common/x86 directory, and then remember most of the code is templated).
Careful coding in general; shorter code is usually better.
Smart low-level algorithms, like the 64-bit bitstream writer I wrote that uses only a single if and no else.
Explicit write-combining.
Taking into account important weird aspects of processors, like Intel's cacheline split issue.
Finding cases where one can losslessly or near-losslessly make an early termination, where the early-termination check costs much less than the speed one gains from it.
Actually inlined assembly for tasks which are far more suited to the x86 SIMD unit, such as median calculations (requires compile-time check for MMX support).

First and foremost, use a better/faster algorithm. There is no point optimizing code that is slow by design.
When optimizing for speed, trade memory for speed: lookup tables of precomputed values, binary trees, write faster custom implementation of system calls...
When trading speed for memory: use in-memory compression

Avoid using the heap. Use obstacks or pool-allocator for identical sized objects. Put small things with short lifetime onto the stack. alloca still exists.

Pre-mature optimization is the root of all evil!
;)

As my applications usually don't need much CPU time by design, I focus on the size my binaries on disk and in memory. What I do mostly is looking out for statically sized arrays and replacing them with dynamically allocated memory where it's worth the additional effort of free'ing the memory later. To cut down the size of the binary, I look for big arrays that are initialized at compile time and put the initializiation to runtime.
char buf[1024] = { 0, };
/* becomes: */
char buf[1024];
memset(buf, 0, sizeof(buf));
This will remove the 1024 zero-bytes from the binaries .DATA section and will instead create the buffer on the stack at runtime and the fill it with zeros.
EDIT: Oh yeah, and I like to cache things. It's not C specific but depending on what you're caching, it can give you a huge boost in performance.
PS: Please let us know when your list is finished, I'm very curious. ;)

If possible, compare with 0, not with arbitrary numbers, especially in loops, because comparison with 0 is often implemented with separate, faster assembler commands.
For example, if possible, write
for (i=n; i!=0; --i) { ... }
instead of
for (i=0; i!=n; ++i) { ... }

Another thing that was not mentioned:
Know your requirements: don't optimize for situations that will unlikely or never happen, concentrate on the most bang for the buck

basics/general:
Do not optimize when you have no problem.
Know your platform/CPU...
...know it thoroughly
know your ABI
Let the compiler do the optimization, just help it with the job.
some things that have actually helped:
Opt for size/memory:
Use bitfields for storing bools
re-use big global arrays by overlaying with a union (be careful)
Opt for speed (be careful):
use precomputed tables where possible
place critical functions/data in fast memory
Use dedicated registers for often used globals
count to-zero, zero flag is free

Difficult to summarize ...
Data structures:
Splitting of a data structure depending on case of usage is extremely important. It is common to see a structure that holds data that is accessed based on a flow control. This situation can lower significantly the cache usage.
To take into account cache line size and prefetch rules.
To reorder the members of the structure to obtain a sequential access to them from your code
Algorithms:
Take time to think about your problem and to find the correct algorithm.
Know the limitations of the algorithm you choose (a radix-sort/quick-sort for 10 elements to be sorted might not be the best choice).
Low level:
As for the latest processors it is not recommended to unroll a loop that has a small body. The processor provides its own detection mechanism for this and will short-circuit whole section of its pipeline.
Trust the HW prefetcher. Of course if your data structures are well designed ;)
Care about your L2 cache line misses.
Try to reduce as much as possible the local working set of your application as the processors are leaning to smaller caches per cores (C2D enjoyed a 3MB per core max where iCore7 will provide a max of 256KB per core + 8MB shared to all cores for a quad core die.).
The most important of all: Measure early, Measure often and never ever makes assumptions, base your thinking and optimizations on data retrieved by a profiler (please use PTU).
Another hint, performance is key to the success of an application and should be considered at design time and you should have clear performance targets.
This is far from being exhaustive but should provide an interesting base.

These days, the most important things in optimzation are:
respecting the cache - try to access memory in simple patterns, and don't unroll loops just for fun. Use arrays instead of data structures with lots of pointer chasing and it'll probably be faster for small amounts of data. And don't make anything too big.
avoiding latency - try to avoid divisions and stuff that's slow if other calculations depend on them immediately. Memory accesses that depend on other memory accesses (ie, a[b[c]]) are bad.
avoiding unpredictabilty - a lot of if/elses with unpredictable conditions, or conditions that introduce more latency, will really mess you up. There's a lot of branchless math tricks that are useful here, but they increase latency and are only useful if you really need them. Otherwise, just write simple code and don't have crazy loop conditions.
Don't bother with optimizations that involve copy-and-pasting your code (like loop unrolling), or reordering loops by hand. The compiler usually does a better job than you at doing this, but most of them aren't smart enough to undo it.

Collecting profiles of code execution get you 50% of the way there. The other 50% deals with analyzing these reports.
Further, if you use GCC or VisualC++, you can use "profile guided optimization" where the compiler will take info from previous executions and reschedule instructions to make the CPU happier.

Inline functions! Inspired by the profiling fans here I profiled an application of mine and found a small function that does some bitshifting on MP3 frames. It makes about 90% of all function calls in my applcation, so I made it inline and voila - the program now uses half of the CPU time it did before.

On most of embedded system i worked there was no profiling tools, so it's nice to say use profiler but not very practical.
First rule in speed optimization is - find your critical path.
Usually you will find that this path is not so long and not so complex. It's hard to say in generic way how to optimize this it's depend on what are you doing and what is in your power to do. For example you want usually avoid memcpy on critical path, so ever you need to use DMA or optimize, but what if you hw does not have DMA ? check if memcpy implementation is a best one if not rewrite it.
Do not use dynamic allocation at all in embedded but if you do for some reason don't do it in critical path.
Organize your thread priorities correctly, what is correctly is real question and it's clearly system specific.
We use very simple tools to analyze the bottle-necks, simple macro that store the time-stamp and index. Few (2-3) runs in 90% of cases will find where you spend your time.
And the last one is code review a very important one. In most case we avoid performance problem during code review very effective way :)

Measure performance.
Use realistic and non-trivial benchmarks. Remember that "everything is fast for small N".
Use a profiler to find hotspots.
Reduce number of dynamic memory allocations, disk accesses, database accesses, network accesses, and user/kernel transitions, because these often tend to be hotspots.
Measure performance.
In addition, you should measure performance.

Sometimes you have to decide whether it is more space or more speed that you are after, which will lead to almost opposite optimizations. For example, to get the most out of you space, you pack structures e.g. #pragma pack(1) and use bit fields in structures. For more speed you pack to align with the processors preference and avoid bitfields.
Another trick is picking the right re-sizing algorithms for growing arrays via realloc, or better still writing your own heap manager based on your particular application. Don't assume the one that comes with the compiler is the best possible solution for every application.

If someone doesn't have an answer to that question, it could be they don't know much.
It could also be that they know a lot. I know a lot (IMHO :-), and if I were asked that question, I would be asking you back: Why do you think that's important?
The problem is, any a-priori notions about performance, if they are not informed by a specific situation, are guesses by definition.
I think it is important to know coding techniques for performance, but I think it is even more important to know not to use them, until diagnosis reveals that there is a problem and what it is.
Now I'm going to contradict myself and say, if you do that, you learn how to recognize the design approaches that lead to trouble so you can avoid them, and to a novice, that sounds like premature optimization.
To give you a concrete example, this is a C application that was optimized.

Great lists. I will just add one tip I didn't saw in the above lists that in some case can yield huge optimisation for minimal cost.
bypass linker
if you have some application divided in two files, say main.c and lib.c, in many cases you can just add a \#include "lib.c" in your main.c That will completely bypass linker and allow for much more efficient optimisation for compiler.
The same effect can be achieved optimizing dependencies between files, but the cost of changes is usually higher.

Sometimes Google is the best algorithm optimization tool. When I have a complex problem, a bit of searching reveals some guys with PhD's have found a mapping between this and a well-known problem and have already done most of the work.

I would recommend optimizing using more efficient algorithms and not do it as an afterthought but code it that way from the start. Let the compiler work out the details on the small things as it knows more about the target processor than you do.
For one, I rarely use loops to look things up, I add items to a hashtable and then use the hashtable to lookup the results.
For example you have a string to lookup and then 50 possible values. So instead of doing 50 strcmps, you add all 50 strings to a hashtable and give each a unique number ( you only have to do this once ). Then you lookup the target string in the hashtable and have one large switch with all 50 cases ( or have functions pointers ).
When looking up things with common sets of input ( like css rules ), I use fast code to keep track of the only possible solitions and then iterate thought those to find a match. Once I have a match I save the results into a hashtable ( as a cache ) and then use the cache results if I get that same input set later.
My main tools for faster code are:
hashtable - for quick lookups and for caching results
qsort - it's the only sort I use
bsp - for looking up things based on area ( map rendering etc )

Profiling a Single Function Predictably

I need a better way of profiling numerical code. Assume that I'm using GCC in Cygwin on 64 bit x86 and that I'm not going to purchase a commercial tool.
The situation is this. I have a single function running in one thread. There are no code dependencies or I/O beyond memory accesses, with the possible exception of some math libraries linked in. But for the most part, it's all table look-ups, index calculations, and numerical processing. I've cache aligned all arrays on the heap and stack. Due to the complexity of the algorithm(s), loop unrolling, and long macros, the assembly listing can become quite lengthy -- thousands of instructions.
I have been resorting to using either, the tic/toc timer in Matlab, the time utility in the bash shell, or using the time stamp counter (rdtsc) directly around the function. The problem is this: the variance (which might be as much as 20% of the runtime) of the timing is larger than the size of the improvements I'm making, so I have no way of knowing if the code is better or worse after a change. You might think then it's time to give up. But I would disagree. If you are persistent, many incremental improvements can lead to a two or three times performance increase.
One problem I have had multiple times that is particularly maddening is that I make a change and the performance seems to improve consistently by say 20%. The next day, the gain is lost. Now it's possible I made what I thought was an innocuous change to the code and then completely forgot about it. But I'm wondering if it's possible something else is going on. Like maybe GCC doesn't yield a 100% deterministic output as I believe it does. Or maybe it's something simpler, like the OS moved my process to a busier core.
I have considered the following, but I don't know if any of these ideas are feasible or make any sense. If yes, I would like explicit instructions on how to implement a solution. The goal is to minimize the variance of the runtime so I can meaningfully compare different versions of optimized code.
Dedicate a core of my processor to run only my routine.
Direct control over the cache(s) (load it up or clear it out).
Ensuring my dll or executable always loads to the same place in memory. My thinking here is that maybe the set-associativity of the cache interacts with the code/data location in RAM to alter performance on each run.
Some kind of cycle accurate emulator tool (not commercial).
Is it possible to have a degree of control over context switches? Or does it even matter? My thinking is the timing of the context switches is causing variability, maybe by causing the pipeline to be flushed at an inopportune time.
In the past I have had success on RISC architectures by counting instructions in the assembly listing. This only works, of course, if the number of instructions is small. Some compilers (like TI's Code Composer for the C67x) will give you a detailed analysis of how it's keeping the ALU busy.
I haven't found the assembly listings produced by GCC/GAS to be particularly informative. With full optimization on, code is moved all over the place. There can be multiple location directives for a single block of code dispersed about the assembly listing. Further, even if I could understand how the assembly maps back into my original code, I'm not sure there's much correlation between instruction count and performance on a modern x86 machine anyway.
I made a weak attempt at using gcov for line-by-line profiling, but due to an incompatibility between the version of GCC I built and the MinGW compiler, it wouldn't work.
One last thing you can do is average over many, many trial runs, but that takes forever.
EDIT (RE: Call Stack Sampling)
The first question I have is, practically, how do I do this? In one of your power point slides, you showed using Visual Studio to pause the program. What I have is a DLL compiled by GCC with full optimizations in Cygwin. This is then called by a mex DLL compiled by Matlab using the VS2013 compiler.
The reason I use Matlab is because I can easily experiment with different parameters and visualize the results without having to write or compile any low level code. Further, I can compare my optimized DLL to the high level Matlab code to ensure my optimizations have not broken anything.
The reason I use GCC is that I have a lot more experience with it than with Microsoft's compiler. I'm familiar with many flags and extensions. Further, Microsoft has been reluctant, at least in the past, to maintain and update the native C compiler (C99). Finally, I've seen GCC kick the pants off commercial compilers, and I've looked at the assembly listing to see how it's actually done. So I have some intuition of how the compiler actually thinks.
Now, with regards to making guesses about what to fix. This isn't really the issue; it's more like making guesses about how to fix it. In this example, as is often the case in numerical algorithms, there is really no I/O (excluding memory). There are no function calls. There's virtually no abstraction at all. It's like I'm sitting on top of a piece of saran wrap. I can see the computer architecture below, and there's really nothing in-between. If I re-rolled up all the loops, I could probably fit the code on about one page or so, and I could almost count the resultant assembly instructions. Then I could do a rough comparison to the theoretical number of operations a single core is capable of doing to see how close to optimal I am. The trouble then is I lose the auto-vectorization and instruction level parallelization I got from unrolling. Unrolled, the assembly listing is too long to analyze in this way.
The point is that there really isn't much to this code. However, due to the incredible complexity of the compiler and modern computer architecture, there is quite a bit of optimization to be had even at this level. But I don't know how small changes are going to affect the output of the compiled code. Let me give a couple of examples.
This first one is somewhat vague, but I'm sure I've seen it happen a few times. You make a small change and get a 10% improvement. You make another small change and get another 10% improvement. You undo the first change and get another 10% improvement. Huh? Compiler optimizations are neither linear, nor monotonic. It's possible, the second change required an additional register, which broke the first change by forcing the compiler to alter its register allocation algorithm. Maybe, the second optimization somehow occluded the compiler's ability to do optimizations which was fixed by undoing the first optimization. Who knows. Unless the compiler is introspective enough to dump its full analysis at every level of abstraction, you'll never really know how you ended up with the final assembly.
Here is a more specific example which happened to me recently. I was hand coding AVX intrinsics to speed up a filter operation. I thought I could unroll the outer loop to increase instruction level parallelism. So I did, and the result was that the code was twice as slow. What happened was there were not enough 256 bit registers to go around. So the compiler was temporarily saving results on the stack, which killed performance.
As I was alluding to in this post, which you commented on, it's best to tell the compiler what you want, but unfortunately, you often have no choice and are forced to hand tweak optimizations, usually via guess and check.
So I guess my question would be, in these scenarios (the code is effectively small until unrolled, each incremental performance change is small, and you're working at a very low level of abstraction), would it be better to have "precision of timing" or is call stack sampling better at telling me which code is superior?

I've faced a similar problem some time ago but that was on Linux which made it easier to tweak. Basically the noise introduced by OS (called "OS jitter") was as big as 5-10% in SPEC2000 tests (I can imagine it's much higher on Windows due to much bigger amount of bloatware).
I was able to bring deviation to below 1% by combination of the following:
disable dynamic frequency scaling (better do this both in BIOS and in Linux kernel as not all kernel versions do this reliably)
disable memory prefetching and other fancy settings like "Turbo boost", etc. (BIOS, again)
disable hyperthreading
enable high-performance process scheduler in kernel
bind process to core to prevent thread migration (use core 0 - for some reason it was more reliable on my kernel, go figure)
boot to single-user mode (in which no services are running) - this isn't as easy in modern systemd-based distros
disable ASLR
disable network
drop OS pagecache
There may be more to it but 1% noise was good enough for me.
I might put detailed instructions to github later today if you need them.
-- EDIT --
I've published my benchmarking script and instructions here.

Am I right that what you're doing is making an educated guess of what to fix, fixing it, and then trying to measure to see if it made any difference?
I do it a different way, which works especially well as the code gets large.
Rather than guess (which I certainly can) I let the program tell me how the time is spent, by using this method.
If the method tells me that roughly 30% is spent doing such-and-so, I can concentrate on finding a better way to do that.
Then I can run it and just time it.
I don't need a lot of precision.
If it's better, that's great.
If it's worse, I can undo the change.
If it's about the same, I can say "Oh well, maybe it didn't save much, but let's do it all again to find another problem,"
I need not worry.
If there's a way to speed up the program, this will pinpoint it.
And often the problem is not just a simple statement like "line or routine X spends Y% of the time", but "the reason it's doing that is Z in certain cases" and the actual fix may be elsewhere.
After fixing it, the process can be done again, because a different problem, which was small before, is now larger (as a percent, because the total has been reduced by fixing the first problem).
Repetition is the key, because each speedup factor multiplies all the previous, like compound interest.
When the program no longer points out things I can fix, I can be sure it is nearly optimal, or at least nobody else is likely to beat it.
And at no point in this process did I need to measure the time with much precision.
Afterwards, if I want to brag about it in a powerpoint, maybe I'll do multiple timings to get smaller standard error, but even then, what people really care about is the overall speedup factor, not the precision.

what are the steps/strategy to analyze and improve performance of an embedded system

I will break down this question in to sub questions. I am confused if I should ask them separately or in one question. So I will just stick to one SO question.
What are generally the steps to analyze and improve performance of C applications?
Do these steps change if I am developing for an embedded system?
What tools are out there which can help me?
Recently I have been given a task to improve the performance of our product on ARM11 platform. I am relatively new to this field of embedded systems and need gurus here on SO to help me out.

simply changing compilers can improve your C performance for the same source code by many times over. GCC has not necessarily gotten better for performance over the years, for some programs gcc 3.x produces much tighter code than 4.x. Back when I had access to the tools, ARMs compiler produced significantly better code than gcc. As much as 3 or 4 times faster. LLVM has caught up to GCC 4.x and I suspect will pass gcc by in terms of performance and overall use for cross compiling embedded code. Try different versions of gcc, 3.x and 4.x if you are using gcc. Metaware's compiler and arms adt ran circles around gcc3.x, gcc3.x will give gcc4.x a run for its money with arm code, for thumb code gcc4.x is better and for thumb2 (which doesnt apply to you) gcc4.x also better. Remember I have not said a word about changing a single line of code (yet).
LLVM is capable of full program optimization in addition to infinitely more tuning knobs than gcc. Despite that the code generated (ver 27) is only just catching up to the current gcc 4.x in terms of performance for the few programs I tried. And I didnt try the n factoral number of optimization combinations (optimize on the compile step, different options for each file, or combine two files or three files or all files and optimize those bundles, my theory is do no optimization on the C to bc steps, link all the bc together then do a single optimization pass on the whole program, the allow the default optimization when llc takes it to the target).
By the same token simply knowing your compiler and the optimizations can greatly improve the performance of the code without having to change any of it. You have an ARM11 arr you compiling for arm11 or generic arm? You can gain a few to a dozen percent by telling the compiler specifically which architecture/family (armv6 for example) over the generic armv4 (ARM7) that is often chosen as the default. Knowing to use -O2 or -O3 if you are brave.
It is often not the case but switching to thumb mode can improve performance for specific platforms. Doesnt apply to you but the gameboy advance is a perfect example, loaded with non-zero wait state 16 bit busses. Thumb has a handful of a percent overhead because it takes more instructions to do the same thing, but by increasing the fetch times, and taking advantage of some of the sequential read features of the gba thumb code can run significantly faster than arm code for the same source code.
having an arm11 you probably have an L1 and maybe L2 cache, are they on? Are they configured? Do you have an mmu and is your heavy use memory cached? or are you running zero wait state memory and dont need a cache and should turn it off? In addition to not realizing that you can take the same source code and make it run many times faster by changing compilers or options, folks often dont realize that when you use a cache simply adding a single up to a few nops in your startup code (as a trick to adjust where code lands in memory by one, two, a few words) you can change your codes execution speed by as much as 10 to 20 percent. Where those cache line reads hit in heavily used functions/loops makes a big difference. Even saving one cache line read by adjusting where the code lands is noticeable (cutting it from 3 to 2 or 2 to 1 for example).
Knowing your architecture, both the processor and your memory environment is where the tuning if any would start. Most C libraries if you are high level enough to use one (I often dont use a C library as I run without an operating system and with very limited resources) both in their C code and sometimes add some assembler to make bottleneck routines like memcpy, much faster. If your programs are operating on aligned 32 or even better 64 bit addresses, and you adjust even if it means using a handful of bytes more memory for every structure/array/memcpy to be an integral multiple of 32 bits or 64 bits you will see noticeable improvements (if your code uses structs or copies data in other ways). In addition to getting your structures (if you use them, I certainly dont with embedded code) size aligned, even if you waste memory, getting elements aligned, consider using 32 bit integers for every element instead of bytes or halfwords. Depending on your memory system this can help (it can hurt too btw). As with the GBA example above looking at specific functions that either by profiling or intuition you know are not being implemented in a manner that takes advantage of your processor or platform or libraries you may want to turn to assembler either from scratch or compiling from C initially then disassembling and hand tuning. Memcpy is a good example you may know your systems memory performance and may chose to create your own memcpy specifically for aligned data, copying 64 or 128 or more bits per instruction.
Likewise mixing global and local variables can make a noticeable performance difference. Traditionally folks are told never to use globals, but in embedded this isnt necessarily true, depends on how deeply embedded and how much tuning and speed and other factors you are interested in. This is a touchy subject and I may get flamed for it, so I will leave it at that.
The compiler has to burn and evict registers in order to make function calls, plus if you use local variables a stack frame may be required, so function calls are expensive, but at the same time, depending on the code within a function that has now grown in size by avoiding functions, you may create the problem you were trying to avoid, evicting registers to re-use them. Even a single line of C code can make the difference between all the variables in a function fits in registers to having to start evicting a bunch of registers. For functions or segments of code where you know you need some performance gain compile and disassemble (and look at register usage, how often it fetches memory or writes to memory). You can and will find places where you need to take a well used loop and make it its own function even though the function call has a penalty because by doing that the compiler can better optimize the loop and not evict/reuse registers and you get an overall net gain. Even a single extra instruction in a loop that goes around hundreds of times is a measurable performance hit.
Hopefully you already know to absolutely not compile for debug, turn all of the compile for debug options off. You may already know that code compile for debug that runs without bugs doesnt mean it is debugged, compiling for debug and using debuggers hide bugs leaving them as time bombs in your code for your final compile for release. Learn to always compile for release and test with the release version both for performance and finding bugs in your code.
Most instruction sets do not have a divide function. Avoid using divides or modulo in your code as much as humanly possible they are performance killers. Naturally this is not the case for powers of two, to save the compiler and to mentally avoid divides and modulos try to use shifts and ands. Multplies are easier and more often found in instruction sets, but are still costly. This is a good case to write assembler to do your multiplies instead of letting the C copiler do it. The arm multiply is a 32bit * 32bit = 32 bit so to do accurate math without overflowing there has to be extra C code wrapped around the multiply, if you already know you wont overflow, burn the registers for a function call and do the multiply in assembler (for the arm).
Likewise most instruction sets do not have a floating point unit, with yours you might, even so avoid float if at all possible. If you have to use float that is a whole other pandora's box of performance issues. Most folks dont see the performance problems with code as simple as this:
float a,b;
...
a = b * 7.0;
The rest of the problem is not understanding floating point accuracy and how good or bad the C libraries are just trying to get your constants into floating point form. Again float is a whole other long discussion on performance problems.
I am a product of Michael Abrash (I actually have a print copy of zen of assembly language) and the bottom line is time your code. Come up with an accurate way to time the code, you may think you know where the bottlenecks are and you may think you know your architecture but trying different things even if you think they are wrong, and timing them you may find and eventually have to figure out the error in your thinking. Adding nops to start.S as a final tuning step is a good example of this, all the other work you have done for performance can be instantly erased by not having a good alignment with the cache, this also means re-arranging functions within your source code so that they land in different places in the binary image. I have seen 10 to 20 percent swings of speed increase and decrease as a result of cache line alignments.

Code Review:
What are good code review techniques ?
Static and dynamic analysis of the code.
Tools for static analysis: Sparrow, Prevent, Klockworks
Tools for dynamic analysis : Valgrind, purify
Gprof allows you to learn where your program spent its time and which functions called which other functions while it was executing.
Steps are same
Apart from what is listed is point 1, there are tools like memcheck etc.
There is a big list here based on platform

Phew!! Quite a big question!
What are generally the steps to
analyze and improve performance of C
applications?
As well as other static code analysers mentioned here there is a fairly cheap version called PC-Lint which has been around for ages. Sometimes throws up lots of errors and warnings for one error but by the end of it you'll be happy and know waaaaay more about C/C++ because of it.
With all code analysers some of the issues may be more structural to the code so best to start analysing it from day 1 of coding; running analysis on old software may swamp you with issues which may take a while to untangle, best to keep it clean from the beginning.
But code analysers will not catch all logical errors, i.e. it doesn't do what you want it to do! These are best done by code reviews first, then testing. Performance is often improved by by trying to keep the algorithms as simple as possible, keeping instructions in loops tight, possibly unrolling loops (your compiler optimisations may do this), use of fast caches when accessing data which is slow to get.
Code reviews can raise a lot of issues from lots of other peoples eyes looking at it. Don't get too many people, try to get 3 other people if possible, sometimes junior developers ask the most insightful questions like, "why are we doing this?".
Testing can be roughly split into two sections, automated and manual. Automated testing requires effort producing test handlers for functions/units but once run can be run again and again very quickly. Manual testing requires planning, self-discipline to perform them all to the required, imagination to think up of scenarios that may impair performance and you have to be observant (you may have passed the test but the 'scope trace has a bit of an anomaly before/after the test).
"Do these steps change if I am
developing for an embedded system?"
Performance ananlysis can be different on embedded systems to applications systems; with the very broad brush that "embedded" now covers it depends how hardware-centric you are. It can be done using profilers, if you want a more cheap and chearful method then use test output pins to measure sections of code, or measure them with breakpoints on simulators that come with the development environment.
Make sure that not just a typical length of task is measured but also a maximum, as that is where one task may start impeding on other tasks and your scheduled tasks are not completed in time.
What tools are out there which can
help me?
Simulators on the IDEs, static analysis tools, dynamic analysis tools, but most of all you and other humans getting the requirements right, decent reviewing (of code and testing) and thorough testing (automated and manual).
Good luck!

My experiences.
Function calls are slow, eliminate with macros or inlined methods. Look at the disassembler listing to see.
If using GCC, mark optimized sections with #pragma GCC optimize("O3") or compile them separately.
Play with different combinations of applying the inline attribute (basically find a balance between size and speed).

It is a difficult question to be answered shortly since various techniques have been proposed such as flowchart and state diagram,so you can take a look at some titles:
ARM System-on-Chip Architecture, 2nd Edition -- Steve Furber
ARM System Developer's Guide - Designing and Optimizing System Software -- Andrew N. Sloss, Dominic Symes, Chris Wright & John Rayfield
The Definitive Guide to the ARM Cortex-M3 --Joseph Yiu
C Programming for Embedded Systems --Kirk Zurell
Embedded C -- Michael J. Pont
Programming Embedded Systems in C and C++ --Michael Barr
An Embedded Software Primer --David E, Simon
Embedded Microprocessor Systems 3rd Edition --Stuart Ball
Global Specification and Validation of Embedded Systems - Integrating Heterogeneous Components --G. Nicolescu & A.A Jerraya
Embedded Systems: Modeling, Technology and Applications --Gunter Hommel & Sheng Huanye
Embedded Systems and Computer Architecture --Graham Wilson
Designing Embedded Hardware --John Catsoulis

You have to use a profiler. It will help you identify your application's bottleneck(s). Then focus on improving the functions you spend the most time in and the ones you call the most. Repeat this procedure until you're satisfied with your application performance.
No they don't.
Depending on the platform you're developing onto :
Windows : AMD Code Analyst, VTune, Sleepy
Linux : valgrind / callgrind / cachegrind
Mac : the Xcode profiler is quite good.
Try to find a profiler for the architecture you actually work on.

C coding practices for performance or code size - beyond what a compiler does

I'm looking to see what can a programmer do in C, that can determine the performance and/or the size of the generated object file.
For e.g,
1. Declaring simple get/set functions as inline may increase performance (at the cost of a larger footprint)
2. For loops that do not use the value of the loop variable itself, count down to zero instead of counting up to a certain value
etc.
It looks like compilers now have advanced to a level where "simple" tricks (like the two points above) are not required at all. Appropriate options during compilation do the job anyway. Heck, I also saw posts here on how compilers handle recursion - that was very interesting! So what are we left to do at a C level then? :)
My specific environment is: GCC 4.3.3 re-targeted for ARM architecture (v4). But responses on other compilers/processors are also welcome and will be munched upon.
PS: This approach of mine goes against the usual "code first!, then benchmark, and finally optimize" approach.
Edit: Just like it so happens, I found a similar post after posting the question: Should we still be optimizing "in the small"?

One thing I can think of that a compiler probably won't optimize is "cache-friendliness": If you're iterating over a two-dimensional array in row-major order, say, make sure your inner loop runs across the column index to avoid cache thrashing. Having the inner loop run over the wrong index can cause a huge performance hit.
This applies to all programming languages, but if you're programming in C, performance is probably critical to you, so it's especially relevant.

"Always" know the time and space complexity of your algorithms. The compiler will never be able to do that job as well as you can. :)

Compilers these days still aren't very good at vectorizing your code so you'll still want to do the SIMD implementation of most algorithms yourself.
Choosing the right datastructures for your exact problem can dramatically increase performance (I've seen cases where moving from a Kd-tree to a BVH would do that, in that specific case).
Compilers might pad some structs/ variables to fit into the cache but other cache optimizations such as the locality of your data are still up to you.
Compilers still don't automatically make your code multithreaded and using openmp, in my experience, doesn't really help much. (You really have to understand openmp anyway to dramatically increase performance). So currently, you're on your own doing multithreading.

To add to what Martin says above about cache-friendliness:
reordering your structures such that fields which are commonly accessed together are in the same cache line can help (for instance by loading just one cache line rather than two.) You are essentially increasing the density of useful data in your data cache by doing this. There is a linux tool which can help you in doing this: dwarves 1. http://www.linuxinsight.com/files/ols2007/melo-reprint.pdf
you can use a similar strategy for increasing density of your code. In gcc you can mark hot and cold branches using likely/unlikely tags. That enables gcc to keep the cold branches separately which helps in increasing the icache density.
And now for something completely different:
for fields that might be accessed (read and written) across CPUs, the opposite strategy makes sense. The trouble is that for coherence purposes only one CPU can be allowed to write to the same address (in reality the same cacheline.) This can lead to a condition called cache-line ping pong. This is pretty bad and could be worse if that cache-line contains other unrelated data. Here, padding this contended data to a cache-line length makes sense.
Note: these clearly are micro-optimizations, to be done only at later stages when you are trying to wring the last bits of performance from your code.

PreComputation where possible... (sorry but its not always possible... I did extensive precomputation on my chess engine.) Store those results in memory, keeping cache in mind.. the bigger the size of precomputation data in memory the lesser is the chance of doing a cache hit. Since most of recent hardware is multicore you can design your application to target it.
if you are using several big arrays make sure you group them close to each other on where they would be used, boosting cache hits

Many people are not aware of this: Define an inline label (varies by compiler) which means inline, in its intent - many compilers place the keyword in an entirely different context from the original meaning. There are also ways to increase the inline size limits, before the compiler begins popping trivial things out of line. Human directed inlining can produce much faster code (compilers are often conservative, or do not account for enough of the program), but you need to learn to use it correctly, because it can (easily) be counterproductive. And yes, this absolutely applies to code size as well as speed.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight