Is a pointer indirection more costly than a conditional? - c

Is a pointer indirection (to fetch a value) more costly than a conditional?
I've observed that most decent compilers can precompute a pointer indirection to varying degrees--possibly removing most branching instructions--but what I'm interested in is whether the cost of an indirection is greater than the cost of a branch point in the generated code.
I would expect that if the data referenced by the pointer is not in a cache at runtime that a cache flush might occur, but I don't have any data to back that.
Does anyone have solid data (or a justifiable opinion) on the matter?
EDIT: Several posters noted that there is no "general case" on the cost of branching: it varies wildly from chip to chip.
If you happen to know of a notable case where branching would be cheaper (with or without branch prediction) than an in-cache indirection, please mention it.

This is very much dependant on the circumstances.
1 How often is the data in cache (L1, L2, L3) or and how often it must be fetched all the way from the RAM?
A fetch from RAM will take around 10-40ns. Of course, that will fill a whole cache-line in little more than that, so if you then use the next few bytes as well, it will definitely not "hurt as bad".
2 What processor is it?
Older Intel Pentium4 were famous for their long pipeline stages, and would take 25-30 clockcycles (~15ns at 2GHz) to "recover" from a branch that was mispredicted.
3 How "predictable" is the condition?
Branch prediction really helps in modern processors, and they can cope quite well with "unpredictable" branches too, but it does hurt a little bit.
4 How "busy" and "dirty" is the cache?
If you have to throw out some dirty data to fill the cache-line, it will take another 15-50ns on top of the "fetch the data in" time.
The indirection itself will be a fast instruction, but of course, if the next instruction uses the data immediately after, you may not be able to execute that instruction immediately - even if the data is in L1 cache.
On a good day (well predicted, target in cache, wind in the right direction, etc), a branch, on the other hand, takes 3-7 cycles.
And finally, of course, the compiler USUALLY knows quite well what works best... ;)
In summary, it's hard to say for sure, and the only way to tell what is better IN YOUR case would be to benchmark alternative solutions. I would thin that an indirect memory access is faster than a jump, but without seeing what code your source compiles to, it's quite hard to say.

It would really depend on your platform. There is no one right answer without looking at the innards of the target CPU. My advice would be to measure it both ways in a test app to see if there is even a noticeable difference.
My gut instinct would be that on a modern CPU, branching through a function pointer and conditional branching both rely on the accuracy of the branch predictor, so I'd expect similar performance from the two techniques if the predictor is presented with similar workloads. (i.e. if it always ends up branching the same way, expect it to be fast; if it's hard to predict, expect it to hurt.) But the only way to know for sure is to run a real test on your target platform.

It depends from processor to processor, but depending on the set of data you're working with, a pipeline flush caused by a mispredicted branch (or badly ordered instructions in some cases) can be more damaging to the speed than a simple cache miss.
In the PowerPC case, for instance, branches not taken (but predicted to be taken) cost about 22 cycles (the time taken to re-fill the pipeline), while a L1 cache miss may cost 600 or so memory cycles. However, if you're going to access contiguous data, it may be better to not branch and let the processor cache-miss your data at the cost of 3 cycles (branches predicted to be taken and taken) for every set of data you're processing.
It all boils down to: test it yourself. The answer is not definitive for all problems.

Since the processor would have to predict the conditional answer in order to plan which instruction has more chances of having to be executed, I would say that the actual cost of the instructions is not important.
Conditional instructions are bad efficiency wise because they make the process flow unpredictable.

Related

How to use software prefetch systematically?

After reading the accepted answer in When should we use prefetch? and examples from Prefetching Examples?, I still have a lot of issues with understanding when to actually use prefetch. While those answers provide an example where prefetch is helpful, they do not explain how to discover it in real programs. It looks like random guessing.
In particular, I am interested in the C implementations for intel x86 (prefetchnta, prefetcht2, prefetcht1, prefetcht0, prefetchw) that are accessible through GCC's __builtin_prefetch intrinsic. I would like to know:
How can I see that software prefetch can help for my specific program? I imagine that I can collect CPU profiling metrics (e.g. number of cache misses) with Intel Vtune or Linux utility perf. In this case what metrics (or relation between them) indicate the opportunity to improve performance with software prefetching?
How I can locate the loads that suffer from cache misses the most?
How to see the cache level where misses happen to decide which prefetch(0,1,2) to use?
Assuming I found a particular load that suffers from the miss in a specific cache level, where should I place prefetch? As an example, assume that the next loop suffers from cache misses
for (int i = 0; i < n; i++) {
// some code
double x = a[i];
// some code
}
Should I place prefetch before or after the load a[i]? How far ahead it should point a[i+m]? Do I need to worry about unrolling the loop to make sure that I am prefetching only on cache line boundaries or it will be almost free like a nop if data is already in cache? Is it worth to use multiple __builtin_prefetch calls in a row to prefetch multiple cache lines at once?
How can I see that software prefetch can help for my specific program?
You can check the proportion of cache misses. perf or VTune can be used to get this information thanks to hardware performance counters. You can get the list with perf list for example. The list is dependent of the target processor architecture but there are some generic events. For example, L1-dcache-load-misses, LLC-load-misses and LLC-store-misses. Having the amount of cache misses is not very useful unless you also get the number of load/store. There are generic counters like L1-dcache-loads, LLC-loads or LLC-stores. AFAIK, for the L2, there is no generic counters (at least on Intel processors) and you need to use specific hardware counters (for example l2_rqsts.miss on Intel Skylake-like processors). To get the overall statistics, you can use perf stat -e an_hardware_counter,another_one your_program. A good documentation can be found here.
When the proportion of misses is big, then you should try to optimize the code, but this is just a hint. In fact, regarding your application, you can have a lot of cache hit but many cache misses in critical part/time of your application. As a result, cache misses can be lost among all the others. This is especially true for the L1 cache references that are massive in scalar codes compared to SIMD ones. One solution is to profile only specific portion of your application and use the knowledge of it so to investigate in the good direction. Performance counters are not really a tool to automatically search issues in your program, but a tool to assist you in validating/disproving some hypothesis or to give some hints about what is happening. It gives you evidences to solve a mysterious case but it is up to you, the detective, to do all the work.
How I can locate the loads that suffer from cache misses the most?
Some hardware performance counters are "precise" meaning that the instruction that has generated the event can be located. This is very useful since you can tell which instructions are responsible for the most cache misses (though it is not always precise in practice). You can use perf record + perf report so to get the information (see the previous tutorial for more information).
Note that there are many reasons that can cause a cache misses and only few cases can be solved by using software prefetching.
How to see the cache level where misses happen to decide which prefetch(0,1,2) to use?
This is often difficult to choose in practice and very dependent of your application. Theoretically, the number is an hint to tell to the processor if the level of locality of the target cache line (eg. fetched into the L1, L2 or L3 cache). For example, if you know that data should be read and reused soon, it is a good idea to put it in the L1. However, if the L1 is used and you do not want to pollute it with data used only once (or rarely used), it is better to fetch data into lower caches. In practice, it is a bit complex since the behavior may not be the same from one architecture to another... See What are _mm_prefetch() locality hints? for more information.
An example of usage is for this question. Software prefetching was used to avoid cache trashing issue with some specific strides. This is a pathological case where the hardware prefetcher is not very useful.
Assuming I found a particular load that suffers from the miss in a specific cache level, where should I place prefetch?
This is clearly the most tricky part. You should prefetch the cache lines sufficiently early so for the latency to be significantly reduced, otherwise the instruction is useless and can actually be detrimental. Indeed, the instruction takes some space in the program, need to be decoded, and use load ports that could be used to execute other (more critical) load instructions for example. However, if it is too late, then the cache line can be evicted and need to be reloaded...
The usual solution is to write a code like this:
for (int i = 0; i < n; i++) {
// some code
const size_t magic_distance_guess = 200;
__builtin_prefetch(&data[i+magic_distance_guess]);
double x = a[i];
// some code
}
Where magic_distance_guess is a value generally set based on benchmarks (or a very deep understanding of the target platform though the practice often shows even highly-skilled developers fail to find the best value).
The thing is the latency is very dependent of where data are coming from and the target platform. In most case, developers cannot really know exactly when to do the prefetching unless they work on a unique given target platform. This makes software prefetching tricky to use and often detrimental when the target platform changes (one has to consider the maintainability of the code and the overhead of the instruction). Not to mention that built-ins are compiler-dependent, prefetching intrinsics are architecture-dependent and there is no standard portable way to use software prefetching.
Do I need to worry about unrolling the loop to make sure that I am prefetching only on cache line boundaries or it will be almost free like a nop if data is already in cache?
Yes, prefetching instructions are not free and so it is better to use only 1 instruction per cache line (as other prefetching instruction on the same cache line will be useless).
Is it worth to use multiple __builtin_prefetch calls in a row to prefetch multiple cache lines at once?
This is very dependent of the target platform. Modern mainstream x86-64 processors execute instructions in an out-of-order way in parallel and they have a pretty huge window of instruction analyzed. They tends to execute load as soon as possible so to avoid misses and they are often very good for such job.
In your example loop, I expect the hardware prefetcher should do a very good job and using software prefetching should be slower on a (relatively recent) mainstream processor.
Software prefetching was useful when hardware prefetchers was not very smart a decade ago but they tends to be very good nowadays. Additionally, it is often better to guide hardware prefetchers than to use software prefetching instructions since the former have a lower overhead. This is why software prefetching is discouraged (eg. by Intel and most developers) unless you really know what you are doing.
How to use software prefetch systematically?
The quick answer is: don't.
As you correctly analyzed, prefetching is a tricky and advanced optimisation technique that is not portable and rarely useful.
You can use profiling to determine what sections of code form a bottleneck and use specialized tools such as valgrind to try and identify cache misses that could potentially be avoided using compiler builtins.
Don't expect too much from this, but do profile the code to concentrate your optimizing efforts where it can be useful.
Remember also that a better algorithm can beat an optimized implementation of a less efficient one for large datasets.

Does Rust's array bounds checking affect performance?

I'm coming from C and I wonder whether Rust's bounds checking affects performance. It probably needs some additional assembly instructions for every access, which could hurt when processing lots of data.
On the other hand, the costly thing in processor performance is memory, so more arithmetic assembler instructions might not hurt, but then it might matter that after a cache line is loaded, sequential access should be very fast.
Has somebody benchmarked this?
Unfortunately, the cost of a bounds check is not a straightforward thing to estimate. It's certainly not "one cycle per check", or any such easy to guess cost. It will have nonzero impact, but it might be insignificant.
In theory, it would be possible to measure the cost of bounds checking on basic types like Vec by modifying Rust to disable them and running a large-scale ecosystem test. This would give some kind of rule of thumb, but without doing it, it's quite hard to know whether this will be closer to a ten percent or a tenth of a percent overhead.
There are some ways you can do better than timing and guessing, though. These rules of thumb apply mostly to desktop-class hardware; lower end hardware or something that targets a different niche will have different characteristics.
If your indices are derived from the container size, there is a good chance that the compiler might be able to eliminate the bounds checks entirely. At this point the only cost of the bounds checks in a release build is that it intermittently interferes with optimizations, which could, but normally doesn't, impede other optimizations.
If your code is branchy, memory access heavy or otherwise hard to optimise, and the bounds to check are easy to access, there is a good chance that bounds checking will manage to happen mostly in the CPU's spare bandwidth, with branch prediction helping out specifically, in which case the overall cost will be particularly small, especially compared to the cost of the rest of the code.
If your bounds to check are behind several layers of pointers, it is plausible that you will hit issues with memory latency, and will suffer correspondingly. However, it is also plausible that speculation and prediction machinery in the CPU will manage to hide this; this is very context-dependent. If you are taking references to the data inside, rather than dereferencing it at the same time as the bounds check, this risk magnifies.
If your bounds checks are in a tight arithmetic loop that doesn't saturate the core, you aren't likely to hurt throughput directly except by impeding other compiler optimisations. However, impeding other compiler optimisations can be arbitrarily bad, anywhere from no difference to preventing SIMD and causing a factor-10 slowdown.
If your bounds checks are in a tight arithmetic loop that does saturate the core, you take on the above risk and have a direct execution penalty of roughly half a cycle per bounds check.
If your code is large enough to stress the instruction cache, then you need to worry about the impact on code size. This is normally modest, but is particularly hard to measure the runtime impact of.
Peter Cordes adds some further points in comments. First, bounds checks imply loads and stores, so you're going to be running a mixed load which is most likely to bottleneck on issue/rename. Second, even predicted branches executed in parallel take resources from the predictor, which can cause other branches to predict worse.
This might seem intimidating, and it is. That is why it's important to measure and understand your performance at the level that is relevant for you and your code.
It is also the case that since Rust was "born" with bounds checking, it has produced means to reduce their cost, such as pervasive zero-cost references, iterators (which absorb, but don't actually remove, bounds checks), and an unusual set of nice utility functions. If you find yourself hitting a pathological case, Rust also offers unsafe escape hatches.

Branch "anticipation" in modern CPUs

I was recently thinking about branch prediction in modern CPUs.
As far as I understand, branch prediction is necessary, because when executing instructions in a pipeline, we don't know the result of the conditional operation right before taking the branch.
Since I know that modern out-of-order CPUs can execute instructions in any order, as long as the data dependencies between them are met, my question is, can CPUs reorder instructions in such a way that the branch target is already known by the time the CPU needs to take the branch, thus can "anticipate" the branch direction, so it doesn't need to guess at all?
So can the CPU turn this:
do_some_work();
if(condition()) //evaluating here requires the cpu to guess the direction or stall
do_this();
else
do_that();
To this:
bool result = condition();
do_some_work(); //bunch of instructions that take longer than the pipeline length
if(result) //value of result is known, thus decision is always 100% correct
do_this();
else
do_that();
A particular and very common use case would be iterating over collections, where the exit condition is often loop-invariant(since we usually don't modify the collection while iterating over it).
My question is can modern generally CPUs do this, and if so, which particular CPU cores are known to have this feature?
Keep in mind that branch prediction is done so early along the pipe, that you still don't have the instruction decoded, and you can't resolve the data dependency because you don't know which register is used. You may be able to remember that somewhere, but that's not 100% (since your storage capacity/time will be limited), so that's pretty much what your normal branch predictor alreay does - speculate the target based on the instruction pointer alone.
However, pulling the condition evaluation earlier is useful, it's been done in the past, and is mostly a compiler technique, but may be enhanced with some HW support (e.g. - hoisting branch condition). The main performance impact of the branch misprediction is the delay in evaluation though, since the branch recovery itself these days is pretty short.
This means that you can mitigate most of the penalty with a compiler hoisting the condition only and calculating this earlier, and without any HW modification - you're still paying the penalty of the flush in case you mispredicted the branch (and the odds are usually low with contemporary predictors), but you'll know that immediately upon decoding the branch itself (since the data will be ready in advance), so the damage will be limited to only a very few instructions that made it down the pipe past that branch.
Being able to hoist the evaluation isn't simple though. The compiler may be able to detect if there are any direct data dependencies in most cases (with do_some_work() in your example), but in most cases there will be. Loop invariants are one of the first things the compiler already moves today. In addition, some of the most hard-to-predict branches depend on some memory fetch, and you usually can't assume memory will stay the same (you can, with some special checks afterward, but most common compilers don't do that). Either way, it's still a compiler technique, and not a fundamental change in branch prediction.
Branch prediction is done because the CPU's instruction fetcher needs to know which instructions to fetch after a branch instruction, and this is not known until after the branch executes.
If a processor has a 5 stage pipeline (most processors have more) like this:
Instruction fetch
Instruction decode
Register read
ALU execution
Register write back
the fetcher will stall for 3 cycles because the branch result won't be known until after the ALU execution cycle.
Hoisting the branch test condition does not address the latency from fetching a branch instruction to its execution.

How to deal with branch prediction when using a switch case in CPU emulation

I recently read the question here Why is it faster to process a sorted array than an unsorted array? and found the answer to be absolutely fascinating and it has completely changed my outlook on programming when dealing with branches that are based on Data.
I currently have a fairly basic, but fully functioning interpreted Intel 8080 Emulator written in C, the heart of the operation is a 256 long switch-case table for handling each opcode. My initial thought was this would obviously be the fastest method of working as opcode encoding isn't consistent throughout the 8080 instruction set and decoding would add a lot of complexity, inconsistency and one-off cases. A switch-case table full of pre-processor macros is a very neat and easy to maintain.
Unfortunately, after reading the aforementioned post it occurred to me that there's absolutely no way the branch predictor in my computer can predict the jumping for the switch case. Thus every time the switch-case is navigated the pipeline would have to be completely wiped, resulting in a several cycle delay in what should otherwise be an incredibly quick program (There's not even so much as multiplication in my code).
I'm sure most of you are thinking "Oh, the solution here is simple, move to dynamic recompilation". Yes, this does seem like it would cut out the majority of the switch-case and increase speed considerably. Unfortunately my primary interest is emulating older 8-bit and 16-bit era consoles (the intel 8080 here is only an example as it's my simplest piece of emulated code) where cycle and timing keeping to the exact instruction is important as the Video and Sound must be processed based on these exact timings.
When dealing with this level of accuracy performance becomes an issue, even for older consoles (Look at bSnes for example). Is there any recourse or is this simply a matter-of-fact when dealing with processors with long pipelines?
On the contrary, switch statements are likely to be converted to jump tables, which means they perform possibly a few ifs (for range checking), and a single jump. The ifs shouldn't cause a problem with branch prediction because it is unlikely you will have a bad op-code. The jump is not so friendly with the pipeline, but in the end, it's only one for the whole switch statement..
I don't believe you can convert a long switch statement of op-codes into any other form that would result in better performance. This is of course, if your compiler is smart enough to convert it to a jump table. If not, you can do so manually.
If in doubt, implement other methods and measure performance.
Edit
First of all, make sure you don't confuse branch prediction and branch target prediction.
Branch prediction solely works on branch statements. It decides whether a branch condition would fail or succeed. They have nothing to do with the jump statement.
Branch target prediction on the other hand tries to guess where the jump will end up in.
So, your statement "there's no way the branch predictor can predict the jump" should be "there's no way the branch target predictor can predict the jump".
In your particular case, I don't think you can actually avoid this. If you had a very small set of operations, perhaps you could come up with a formula that covers all your operations, like those made in logic circuits. However, with an instruction set as big as a CPU's, even if it were RISC, the cost of that computation is much higher than the penalty of a single jump.
As the branches on your 256-way switch statement are densely packed the compiler will implement this as a jump table, so you're correct in that you'll trigger a single branch mispredict every time you pass through this code (as the indirect jump won't display any kind of predictable behaviour). The penalty associated with this will be around 15 clock cycles on a modern CPU (Sandy Bridge), or maybe up to 25 on older microarchitectures that lack a micro-op cache. A good reference for this sort of thing is "Software optimisation resources" on agner.org. Page 43 in "Optimizing software in C++" is a good place to start.
http://www.agner.org/optimize/?e=0,34
The only way you could avoid this penalty is by ensuring that the same instructions are execution regardless of the value of the opcode. This can often be done by using conditional moves (which add a data dependency so are slower than a predictable branch) or otherwise looking for symmetry in your code paths. Considering what you're trying to do this is probably not going to be possible, and if it was then it would almost certainly add a overhead greater than the 15-25 clock cycles for the mispredict.
In summary, on a modern architecture there's not much you can do that'll be more efficient than a switch/case, and the cost of mispredicting a branch isn't as much as you might expect.
The indirect jump is probably the best thing to do for instruction decoding.
On older machines, like say the Intel P6 from 1997, the indirect jump would probably get a branch misprediction.
On modern machines, like say Intel Core i7, there is an indirect jump predictor that does a fairly good job of avoiding the branch misprediction.
But even on the older machines that do not have an indirect branch predictor, you can play a trick. This trick is (was), by the way, documented in the Intel Code Optimization Guide from way back in the Intel P6 days:
Instead of generating something that looks like
loop:
load reg := next_instruction_bits // or byte or word
load reg2 := instruction_table[reg]
jmp [reg]
label_instruction_00h_ADD: ...
jmp loop
label_instruction_01h_SUB: ...
jmp loop
...
generate the code as
loop:
load reg := next_instruction_bits // or byte or word
load reg2 := instruction_table[reg]
jmp [reg]
label_instruction_00h_ADD: ...
load reg := next_instruction_bits // or byte or word
load reg2 := instruction_table[reg]
jmp [reg]
label_instruction_01h_SUB: ...
load reg := next_instruction_bits // or byte or word
load reg2 := instruction_table[reg]
jmp [reg]
...
i.e. replace the jump to the top of the instruction fetch/decode/execute loop
by the code at the top of the loop at each place.
It turns out that this has much better branch prediction, even in the absence of an indirect predictor. More precisely, a conditional, single target, PC indexed BTB will be quite a lot better in this latter, threaded, code, than on the original with only a single copy of the indirect jump.
Most instruction sets have special patterns - e.g. on Intel x86, a compare instruction is nearly always followed by a branch.
Good luck and have fun!
(In case you care, the instruction decoders used by instruction set simulators in industry nearly always do a tree of N-way jumps, or the data-driven dual, navigate a tree of N-way tables, with each entry in the tree pointing to other nodes, or to a function to evaluate.
Oh, and perhaps I should mention: these tables, these switch statements or data structures, are generated by special purpose tools.
A tree of N-way jumps, because there are problems when the number of cases in the jump table gets very large - in the tool, mkIrecog (make instruction recognizer) that I wrote in the 1980s, I usually did jump tables up to 64K entries in size, i.e. jumping on 16 bits. The compilers of the time broke when the jump tables exceeded 16M in size (24 bits).
Data driven, i.e. a tree of nodes pointing to other nodes because (a) on older machines indirect jumps may not be predicted well, and (b) it turns out that much of the time there is common code between instructions - instead of having a branch misprediction when jumping to the case per instruction, then executing common code, then switching again, and getting a second mispredict, you do the common code, with slightly different parameters (like, how many bits of the instruction stream do you consume, and where the next set of bits to branch on is (are).
I was very aggressive in mkIrecog, as I say allowing up to 32 bits to be used in a switch, although practical limitations nearly always stopped me at 16-24 bits. I remember that I often saw the first decode as a 16 or 18 bit switch (64K-256K entries), and all other decodes were much smaller, no bigger than 10 bits.
Hmm: I posted mkIrecog to Usenet back circa 1990. ftp://ftp.lf.net/pub/unix/programming/misc/mkIrecog.tar.gz
You may be able to see the tables used, if you care.
(Be kind: I was young then. I can't remember if this was Pascal or C. I have since rewritten it many times - although I have not yet rewritten it to use C++ bit vectors.)
Most of the other guys I know who do this sort of thing do things a byte at a time - i.e. an 8 bit, 256 way, branch or table lookup.)
I thought I'd add something since no one mentioned it.
Granted, the indirect jump is likely to be the best option.
However, should you go with the N-compare way, there are two things that come to my mind:
First, instead of doing N equality compares, you could do log(N) inequality compares, testing your instructions based on their numerical opcode by dichotomy (or test the number bit by bit if the value space is near to full) .This is a bit like a hashtable, you implement a static tree to find the final element.
Second, you could run an analysis on the binary code you want to execute.
You could even do that per binary, before execution, and runtime-patch your emulator.
This analysis would build a histogram representing the frequency of instructions, and then you would organize your tests so that the most frequent instructions get predicted correctly.
But I cant see this being faster than a medium 15 cycles penalty, unless you have 99% of MOV and you put an equality for the MOV opcode before the other tests.

Design code to fit in CPU Cache?

When writing simulations my buddy says he likes to try to write the program small enough to fit into cache. Does this have any real meaning? I understand that cache is faster than RAM and the main memory. Is it possible to specify that you want the program to run from cache or at least load the variables into cache? We are writing simulations so any performance/optimization gain is a huge benefit.
If you know of any good links explaining CPU caching, then point me in that direction.
At least with a typical desktop CPU, you can't really specify much about cache usage directly. You can still try to write cache-friendly code though. On the code side, this often means unrolling loops (for just one obvious example) is rarely useful -- it expands the code, and a modern CPU typically minimizes the overhead of looping. You can generally do more on the data side, to improve locality of reference, protect against false sharing (e.g. two frequently-used pieces of data that will try to use the same part of the cache, while other parts remain unused).
Edit (to make some points a bit more explicit):
A typical CPU has a number of different caches. A modern desktop processor will typically have at least 2 and often 3 levels of cache. By (at least nearly) universal agreement, "level 1" is the cache "closest" to the processing elements, and the numbers go up from there (level 2 is next, level 3 after that, etc.)
In most cases, (at least) the level 1 cache is split into two halves: an instruction cache and a data cache (the Intel 486 is nearly the sole exception of which I'm aware, with a single cache for both instructions and data--but it's so thoroughly obsolete it probably doesn't merit a lot of thought).
In most cases, a cache is organized as a set of "lines". The contents of a cache is normally read, written, and tracked one line at a time. In other words, if the CPU is going to use data from any part of a cache line, that entire cache line is read from the next lower level of storage. Caches that are closer to the CPU are generally smaller and have smaller cache lines.
This basic architecture leads to most of the characteristics of a cache that matter in writing code. As much as possible, you want to read something into cache once, do everything with it you're going to, then move on to something else.
This means that as you're processing data, it's typically better to read a relatively small amount of data (little enough to fit in the cache), do as much processing on that data as you can, then move on to the next chunk of data. Algorithms like Quicksort that quickly break large amounts of input in to progressively smaller pieces do this more or less automatically, so they tend to be fairly cache-friendly, almost regardless of the precise details of the cache.
This also has implications for how you write code. If you have a loop like:
for i = 0 to whatever
step1(data);
step2(data);
step3(data);
end for
You're generally better off stringing as many of the steps together as you can up to the amount that will fit in the cache. The minute you overflow the cache, performance can/will drop drastically. If the code for step 3 above was large enough that it wouldn't fit into the cache, you'd generally be better off breaking the loop up into two pieces like this (if possible):
for i = 0 to whatever
step1(data);
step2(data);
end for
for i = 0 to whatever
step3(data);
end for
Loop unrolling is a fairly hotly contested subject. On one hand, it can lead to code that's much more CPU-friendly, reducing the overhead of instructions executed for the loop itself. At the same time, it can (and generally does) increase code size, so it's relatively cache unfriendly. My own experience is that in synthetic benchmarks that tend to do really small amounts of processing on really large amounts of data, that you gain a lot from loop unrolling. In more practical code where you tend to have more processing on an individual piece of data, you gain a lot less--and overflowing the cache leading to a serious performance loss isn't particularly rare at all.
The data cache is also limited in size. This means that you generally want your data packed as densely as possible so as much data as possible will fit in the cache. Just for one obvious example, a data structure that's linked together with pointers needs to gain quite a bit in terms of computational complexity to make up for the amount of data cache space used by those pointers. If you're going to use a linked data structure, you generally want to at least ensure you're linking together relatively large pieces of data.
In a lot of cases, however, I've found that tricks I originally learned for fitting data into minuscule amounts of memory in tiny processors that have been (mostly) obsolete for decades, works out pretty well on modern processors. The intent is now to fit more data in the cache instead of the main memory, but the effect is nearly the same. In quite a few cases, you can think of CPU instructions as nearly free, and the overall speed of execution is governed by the bandwidth to the cache (or the main memory), so extra processing to unpack data from a dense format works out in your favor. This is particularly true when you're dealing with enough data that it won't all fit in the cache at all any more, so the overall speed is governed by the bandwidth to main memory. In this case, you can execute a lot of instructions to save a few memory reads, and still come out ahead.
Parallel processing can exacerbate that problem. In many cases, rewriting code to allow parallel processing can lead to virtually no gain in performance, or sometimes even a performance loss. If the overall speed is governed by the bandwidth from the CPU to memory, having more cores competing for that bandwidth is unlikely to do any good (and may do substantial harm). In such a case, use of multiple cores to improve speed often comes down to doing even more to pack the data more tightly, and taking advantage of even more processing power to unpack the data, so the real speed gain is from reducing the bandwidth consumed, and the extra cores just keep from losing time to unpacking the data from the denser format.
Another cache-based problem that can arise in parallel coding is sharing (and false sharing) of variables. If two (or more) cores need to write to the same location in memory, the cache line holding that data can end up being shuttled back and forth between the cores to give each core access to the shared data. The result is often code that runs slower in parallel than it did in serial (i.e., on a single core). There's a variation of this called "false sharing", in which the code on the different cores is writing to separate data, but the data for the different cores ends up in the same cache line. Since the cache controls data purely in terms of entire lines of data, the data gets shuffled back and forth between the cores anyway, leading to exactly the same problem.
Here's a link to a really good paper on caches/memory optimization by Christer Ericsson (of God of War I/II/III fame). It's a couple of years old but it's still very relevant.
A useful paper that will tell you more than you ever wanted to know about caches is What Every Programmer Should Know About Memory by Ulrich Drepper. Hennessey covers it very thoroughly. Christer and Mike Acton have written a bunch of good stuff about this too.
I think you should worry more about data cache than instruction cache — in my experience, dcache misses are more frequent, more painful, and more usefully fixed.
UPDATE: 1/13/2014
According to this senior chip designer, cache misses are now THE overwhelmingly dominant factor in code performance, so we're basically all the way back to the mid-80s and fast 286 chips in terms of the relative performance bottlenecks of load, store, integer arithmetic, and cache misses.
A Crash Course In Modern Hardware by Cliff Click # Azul
.
.
.
.
.
--- we now return you to your regularly scheduled program ---
Sometimes an example is better than a description of how to do something. In that spirit here's a particularly successful example of how I changed some code to better use on chip caches. This was done some time ago on a 486 CPU and latter migrated to a 1st Generation Pentium CPU. The effect on performance was similar.
Example: Subscript Mapping
Here's an example of a technique I used to fit data into the chip's cache that has general purpose utility.
I had a double float vector that was 1,250 elements long, which was an epidemiology curve with very long tails. The "interesting" part of the curve only had about 200 unique values but I didn't want a 2-sided if() test to make a mess of the CPU's pipeline(thus the long tails, which could use as subscripts the most extreme values the Monte Carlo code would spit out), and I needed the branch prediction logic for a dozen other conditional tests inside the "hot-spot" in the code.
I settled on a scheme where I used a vector of 8-bit ints as a subscript into the double vector, which I shortened to 256 elements. The tiny ints all had the same values before 128 ahead of zero, and 128 after zero, so except for the middle 256 values, they all pointed to either the first or last value in the double vector.
This shrunk the storage requirement to 2k for the doubles, and 1,250 bytes for the 8-bit subscripts. This shrunk 10,000 bytes down to 3,298. Since the program spent 90% or more of it's time in this inner-loop, the 2 vectors never got pushed out of the 8k data cache. The program immediately doubled its performance. This code got hit ~ 100 billion times in the process of computing an OAS value for 1+ million mortgage loans.
Since the tails of the curve were seldom touched, it's very possible that only the middle 200-300 elements of the tiny int vector were actually kept in cache, along with 160-240 middle doubles representing 1/8ths of percents of interest. It was a remarkable increase in performance, accomplished in an afternoon, on a program that I'd spent over a year optimizing.
I agree with Jerry, as it has been my experience also, that tilting the code towards the instruction cache is not nearly as successful as optimizing for the data cache/s. This is one reason I think AMD's common caches are not as helpful as Intel's separate data and instruction caches. IE: you don't want instructions hogging up the cache, as it just isn't very helpful. In part this is because CISC instruction sets were originally created to make up for the vast difference between CPU and memory speeds, and except for an aberration in the late 80's, that's pretty much always been true.
Another favorite technique I use to favor the data cache, and savage the instruction cache, is by using a lot of bit-ints in structure definitions, and the smallest possible data sizes in general. To mask off a 4-bit int to hold the month of the year, or 9 bits to hold the day of the year, etc, etc, requires the CPU use masks to mask off the host integers the bits are using, which shrinks the data, effectively increases cache and bus sizes, but requires more instructions. While this technique produces code that doesn't perform as well on synthetic benchmarks, on busy systems where users and processes are competing for resources, it works wonderfully.
Mostly this will serve as a placeholder until I get time to do this topic justice, but I wanted to share what I consider to be a truly groundbreaking milestone - the introduction of dedicated bit manipulation instructions in the new Intel Hazwell microprocessor.
It became painfully obvious when I wrote some code here on StackOverflow to reverse the bits in a 4096 bit array that 30+ yrs after the introduction of the PC, microprocessors just don't devote much attention or resources to bits, and that I hope will change. In particular, I'd love to see, for starters, the bool type become an actual bit datatype in C/C++, instead of the ridiculously wasteful byte it currently is.
UPDATE: 12/29/2013
I recently had occasion to optimize a ring buffer which keeps track of 512 different resource users' demands on a system at millisecond granularity. There is a timer which fires every millisecond which added the sum of the most current slice's resource requests and subtracted out the 1,000th time slice's requests, comprising resource requests now 1,000 milliseconds old.
The Head, Tail vectors were right next to each other in memory, except when first the Head, and then the Tail wrapped and started back at the beginning of the array. The (rolling)Summary slice however was in a fixed, statically allocated array that wasn't particularly close to either of those, and wasn't even allocated from the heap.
Thinking about this, and studying the code a few particulars caught my attention.
The demands that were coming in were added to the Head and the Summary slice at the same time, right next to each other in adjacent lines of code.
When the timer fired, the Tail was subtracted out of the Summary slice, and the results were left in the Summary slice, as you'd expect
The 2nd function called when the timer fired advanced all the pointers servicing the ring. In particular....
The Head overwrote the Tail, thereby occupying the same memory location
The new Tail occupied the next 512 memory locations, or wrapped
The user wanted more flexibility in the number of demands being managed, from 512 to 4098, or perhaps more. I felt the most robust, idiot-proof way to do this was to allocate both the 1,000 time slices and the summary slice all together as one contiguous block of memory so that it would be IMPOSSIBLE for the Summary slice to end up being a different length than the other 1,000 time slices.
Given the above, I began to wonder if I could get more performance if, instead of having the Summary slice remain in one location, I had it "roam" between the Head and the Tail, so it was always right next to the Head for adding new demands, and right next to the Tail when the timer fired and the Tail's values had to be subtracted from the Summary.
I did exactly this, but then found a couple of additional optimizations in the process. I changed the code that calculated the rolling Summary so that it left the results in the Tail, instead of the Summary slice. Why? Because the very next function was performing a memcpy() to move the Summary slice into the memory just occupied by the Tail. (weird but true, the Tail leads the Head until the end of the ring when it wraps). By leaving the results of the summation in the Tail, I didn't have to perform the memcpy(), I just had to assign pTail to pSummary.
In a similar way, the new Head occupied the now stale Summary slice's old memory location, so again, I just assigned pSummary to pHead, and zeroed all its values with a memset to zero.
Leading the way to the end of the ring(really a drum, 512 tracks wide) was the Tail, but I only had to compare its pointer against a constant pEndOfRing pointer to detect that condition. All of the other pointers could be assigned the pointer value of the vector just ahead of it. IE: I only needed a conditional test for 1:3 of the pointers to correctly wrap them.
The initial design had used byte ints to maximize cache usage, however, I was able to relax this constraint - satisfying the users request to handle higher resource counts per user per millisecond - to use unsigned shorts and STILL double performance, because even with 3 adjacent vectors of 512 unsigned shorts, the L1 cache's 32K data cache could easily hold the required 3,720 bytes, 2/3rds of which were in locations just used. Only when the Tail, Summary, or Head wrapped were 1 of the 3 separated by any significant "step" in the 8MB L3cache.
The total run-time memory footprint for this code is under 2MB, so it runs entirely out of on-chip caches, and even on an i7 chip with 4 cores, 4 instances of this process can be run without any degradation in performance at all, and total throughput goes up slightly with 5 processes running. It's an Opus Magnum on cache usage.
Most C/C++ compilers prefer to optimize for size rather than for "speed". That is, smaller code generally executes faster than unrolled code because of cache effects.
If I were you, I would make sure I know which parts of code are hotspots, which I define as
a tight loop not containing any function calls, because if it calls any function, then the PC will be spending most of its time in that function,
that accounts for a significant fraction of execution time (like >= 10%) which you can determine from a profiler. (I just sample the stack manually.)
If you have such a hotspot, then it should fit in the cache. I'm not sure how you tell it to do that, but I suspect it's automatic.

Resources