Optimizing ARM cache usage for different arrays - arm

I want to port a small piece of code on ARM Cortex A8 processor. Both L1 cache and L2 cache are very limited. There are 3 arrays in my program. Two of them are sequentially accessed(size> Array A: 6MB and Array B: 3MB) and the access pattern for the third array(size> Array C: 3MB) is unpredictable. Though the calculations are not very rigorous but there are huge cache misses for accessing array C. One solution that I thought would be to allocate more cache (L2) space for array C and less for Array A & B. But I'm not able to find any way to achieve this. I went through preload engine of ARM but could not find anything useful.

It would be a good idea to split the cache and allocate each array in a different part of it.
Unfortunately that is not possible. The caches of the CortexA8 just are not that flexible. The good old StrongArm had a secondary cache for exactly this splitting purpose, but it's not available anymore. We have L1 and L2 caches instead (overall a good change imho.)
However, there is a thing you can do:
The NEON SIMD unit of the CortexA8 lags behind the general purpose processing unit by around 10 processor cycles. With clever programming you can issue cache prefetches from the general purpose unit but do the accesses via NEON. The delay between the two pipelines gives the cache a bit of time to do the prefetches, so your average cache miss time will be lower.
The drawback is that if you must never move the result of a calculation back from NEON to the ARM unit. Since NEON lags behind this will cause a full CPU pipeline flush. Almost if not even more costly as a cache miss.
The difference in performance can be significant. Out of the blue I would expect something between 20% and 30% of speed improvement.

From what I could find via Google, it looks like ARMv7 (which is the version of the ISA that Cortex A8 supports) has cache-flush capability, though I couldn't find a clear reference on how to use it -- perhaps you can do better if you spend more time on it than the minute or two I spent typing "ARM cache flush" into a search box and reading the results.
In any case, you should be able to achieve an approximation of what you want by periodically issuing "flush" instructions to flush out the parts of A and B that you know you no longer need.


How to use software prefetch systematically?

After reading the accepted answer in When should we use prefetch? and examples from Prefetching Examples?, I still have a lot of issues with understanding when to actually use prefetch. While those answers provide an example where prefetch is helpful, they do not explain how to discover it in real programs. It looks like random guessing.
In particular, I am interested in the C implementations for intel x86 (prefetchnta, prefetcht2, prefetcht1, prefetcht0, prefetchw) that are accessible through GCC's __builtin_prefetch intrinsic. I would like to know:
How can I see that software prefetch can help for my specific program? I imagine that I can collect CPU profiling metrics (e.g. number of cache misses) with Intel Vtune or Linux utility perf. In this case what metrics (or relation between them) indicate the opportunity to improve performance with software prefetching?
How I can locate the loads that suffer from cache misses the most?
How to see the cache level where misses happen to decide which prefetch(0,1,2) to use?
Assuming I found a particular load that suffers from the miss in a specific cache level, where should I place prefetch? As an example, assume that the next loop suffers from cache misses
for (int i = 0; i < n; i++) {
// some code
double x = a[i];
// some code
Should I place prefetch before or after the load a[i]? How far ahead it should point a[i+m]? Do I need to worry about unrolling the loop to make sure that I am prefetching only on cache line boundaries or it will be almost free like a nop if data is already in cache? Is it worth to use multiple __builtin_prefetch calls in a row to prefetch multiple cache lines at once?
How can I see that software prefetch can help for my specific program?
You can check the proportion of cache misses. perf or VTune can be used to get this information thanks to hardware performance counters. You can get the list with perf list for example. The list is dependent of the target processor architecture but there are some generic events. For example, L1-dcache-load-misses, LLC-load-misses and LLC-store-misses. Having the amount of cache misses is not very useful unless you also get the number of load/store. There are generic counters like L1-dcache-loads, LLC-loads or LLC-stores. AFAIK, for the L2, there is no generic counters (at least on Intel processors) and you need to use specific hardware counters (for example l2_rqsts.miss on Intel Skylake-like processors). To get the overall statistics, you can use perf stat -e an_hardware_counter,another_one your_program. A good documentation can be found here.
When the proportion of misses is big, then you should try to optimize the code, but this is just a hint. In fact, regarding your application, you can have a lot of cache hit but many cache misses in critical part/time of your application. As a result, cache misses can be lost among all the others. This is especially true for the L1 cache references that are massive in scalar codes compared to SIMD ones. One solution is to profile only specific portion of your application and use the knowledge of it so to investigate in the good direction. Performance counters are not really a tool to automatically search issues in your program, but a tool to assist you in validating/disproving some hypothesis or to give some hints about what is happening. It gives you evidences to solve a mysterious case but it is up to you, the detective, to do all the work.
How I can locate the loads that suffer from cache misses the most?
Some hardware performance counters are "precise" meaning that the instruction that has generated the event can be located. This is very useful since you can tell which instructions are responsible for the most cache misses (though it is not always precise in practice). You can use perf record + perf report so to get the information (see the previous tutorial for more information).
Note that there are many reasons that can cause a cache misses and only few cases can be solved by using software prefetching.
How to see the cache level where misses happen to decide which prefetch(0,1,2) to use?
This is often difficult to choose in practice and very dependent of your application. Theoretically, the number is an hint to tell to the processor if the level of locality of the target cache line (eg. fetched into the L1, L2 or L3 cache). For example, if you know that data should be read and reused soon, it is a good idea to put it in the L1. However, if the L1 is used and you do not want to pollute it with data used only once (or rarely used), it is better to fetch data into lower caches. In practice, it is a bit complex since the behavior may not be the same from one architecture to another... See What are _mm_prefetch() locality hints? for more information.
An example of usage is for this question. Software prefetching was used to avoid cache trashing issue with some specific strides. This is a pathological case where the hardware prefetcher is not very useful.
Assuming I found a particular load that suffers from the miss in a specific cache level, where should I place prefetch?
This is clearly the most tricky part. You should prefetch the cache lines sufficiently early so for the latency to be significantly reduced, otherwise the instruction is useless and can actually be detrimental. Indeed, the instruction takes some space in the program, need to be decoded, and use load ports that could be used to execute other (more critical) load instructions for example. However, if it is too late, then the cache line can be evicted and need to be reloaded...
The usual solution is to write a code like this:
for (int i = 0; i < n; i++) {
// some code
const size_t magic_distance_guess = 200;
double x = a[i];
// some code
Where magic_distance_guess is a value generally set based on benchmarks (or a very deep understanding of the target platform though the practice often shows even highly-skilled developers fail to find the best value).
The thing is the latency is very dependent of where data are coming from and the target platform. In most case, developers cannot really know exactly when to do the prefetching unless they work on a unique given target platform. This makes software prefetching tricky to use and often detrimental when the target platform changes (one has to consider the maintainability of the code and the overhead of the instruction). Not to mention that built-ins are compiler-dependent, prefetching intrinsics are architecture-dependent and there is no standard portable way to use software prefetching.
Do I need to worry about unrolling the loop to make sure that I am prefetching only on cache line boundaries or it will be almost free like a nop if data is already in cache?
Yes, prefetching instructions are not free and so it is better to use only 1 instruction per cache line (as other prefetching instruction on the same cache line will be useless).
Is it worth to use multiple __builtin_prefetch calls in a row to prefetch multiple cache lines at once?
This is very dependent of the target platform. Modern mainstream x86-64 processors execute instructions in an out-of-order way in parallel and they have a pretty huge window of instruction analyzed. They tends to execute load as soon as possible so to avoid misses and they are often very good for such job.
In your example loop, I expect the hardware prefetcher should do a very good job and using software prefetching should be slower on a (relatively recent) mainstream processor.
Software prefetching was useful when hardware prefetchers was not very smart a decade ago but they tends to be very good nowadays. Additionally, it is often better to guide hardware prefetchers than to use software prefetching instructions since the former have a lower overhead. This is why software prefetching is discouraged (eg. by Intel and most developers) unless you really know what you are doing.
How to use software prefetch systematically?
The quick answer is: don't.
As you correctly analyzed, prefetching is a tricky and advanced optimisation technique that is not portable and rarely useful.
You can use profiling to determine what sections of code form a bottleneck and use specialized tools such as valgrind to try and identify cache misses that could potentially be avoided using compiler builtins.
Don't expect too much from this, but do profile the code to concentrate your optimizing efforts where it can be useful.
Remember also that a better algorithm can beat an optimized implementation of a less efficient one for large datasets.

Benchmarking microcontrollers

currently I am working on setting up benchmark between microcontrollers (based on Powerpc). So I would like to know, if anyone can provide me some documentation showing in detail, what factors are most important to be considered for benchmarking?
In other words I am looking for documentation which provides detailed information about factors that should be considered for enhancement in the performance of
Memory banks
Plus, if someone could provide algorithms that will be lot helpful.
There is only one useful way and that is to write your application for both and time your application. Benchmarks are for the most part bogus there are too many factors and it is quite trivial to craft a benchmark that takes advantage of the differences, or even takes advantage of the common features in a way to make two things look different.
I perform this stunt on a regular basis, most recently this code
subs r0,r0,#1
bx lr
Run on a raspberry pi (bare metal) the same raspberry pi not comparing two just comparing it to itself, clearly assembly so not even taking into account compiler features/tricks that you can encode in the benchmark intentionally or accidentally. Two of those three instructions matter for benchmarking purposes, have the loop run many tens of thousands of times I think I used 0x100000. The punchline to that performance was those two instructions in a loop ran as fast as 93662 timer ticks and as slow as 4063837 timer ticks for 0x10000 loops. Certainly i cache and branch prediction were turned on and off for various tests. But even with both branch prediction on and the i cache on, these two instructions will vary in speed depending on where they lie within the fetch line and the cache line.
A microcontroller makes this considerably worse depending on what you are comparing, some have flashes that can use the same wait state for a wide range of clock speeds, some are speed limited and for every N Mhz you have to add another wait state, so depending on where you set your clock it affects performance across that range and definitely just below and just above the boundary where you add a wait state (24Mhz minus a smidge and 24Mhz with an extra wait state if it was from 2-3 wait states then fetching just got 50% slower 36Mhz minus a smidge it may still be at the 3 wait states but 3 wait states at 36minus a smidge is faster than 24mhz 3 wait states). if you run the same code in sram vs flash for those platforms there usually isnt a wait state issue the sram can usually match the cpu clock and so that code at any speed may be faster than the same code run from flash.
If you are comparing two microcontrollers from the same vendor and family then it is usually pointless, the internals are the same they usually just vary by how many, how many flash banks how many sram banks how many uarts, how many timers, how many pins, etc.
One of my points being if you dont know the nuances of the overall architecture, you can possibly make the same code you are running now on the same board a few percent to tens of times faster by simply understanding how things work. Enabling features you didnt know where there, proper alignment of the code that is exercised often (simply re-arranging your functions within a C file can/will affect performance) adding one or more nops in the bootstrap to change the alignment of the whole program can and will change performance.
Then you get into compiler differences and compiler options, you can play with those and also get some to several to dozens of times improvement (or loss).
So at the end of the day the only thing that matters is I have an application it is the final binary and how fast does it run on A, then I ported that application and the final binary for B is done and how fast does it run there. Everything else can be manipulated, the results cant be trusted.

Is a pointer indirection more costly than a conditional?

Is a pointer indirection (to fetch a value) more costly than a conditional?
I've observed that most decent compilers can precompute a pointer indirection to varying degrees--possibly removing most branching instructions--but what I'm interested in is whether the cost of an indirection is greater than the cost of a branch point in the generated code.
I would expect that if the data referenced by the pointer is not in a cache at runtime that a cache flush might occur, but I don't have any data to back that.
Does anyone have solid data (or a justifiable opinion) on the matter?
EDIT: Several posters noted that there is no "general case" on the cost of branching: it varies wildly from chip to chip.
If you happen to know of a notable case where branching would be cheaper (with or without branch prediction) than an in-cache indirection, please mention it.
This is very much dependant on the circumstances.
1 How often is the data in cache (L1, L2, L3) or and how often it must be fetched all the way from the RAM?
A fetch from RAM will take around 10-40ns. Of course, that will fill a whole cache-line in little more than that, so if you then use the next few bytes as well, it will definitely not "hurt as bad".
2 What processor is it?
Older Intel Pentium4 were famous for their long pipeline stages, and would take 25-30 clockcycles (~15ns at 2GHz) to "recover" from a branch that was mispredicted.
3 How "predictable" is the condition?
Branch prediction really helps in modern processors, and they can cope quite well with "unpredictable" branches too, but it does hurt a little bit.
4 How "busy" and "dirty" is the cache?
If you have to throw out some dirty data to fill the cache-line, it will take another 15-50ns on top of the "fetch the data in" time.
The indirection itself will be a fast instruction, but of course, if the next instruction uses the data immediately after, you may not be able to execute that instruction immediately - even if the data is in L1 cache.
On a good day (well predicted, target in cache, wind in the right direction, etc), a branch, on the other hand, takes 3-7 cycles.
And finally, of course, the compiler USUALLY knows quite well what works best... ;)
In summary, it's hard to say for sure, and the only way to tell what is better IN YOUR case would be to benchmark alternative solutions. I would thin that an indirect memory access is faster than a jump, but without seeing what code your source compiles to, it's quite hard to say.
It would really depend on your platform. There is no one right answer without looking at the innards of the target CPU. My advice would be to measure it both ways in a test app to see if there is even a noticeable difference.
My gut instinct would be that on a modern CPU, branching through a function pointer and conditional branching both rely on the accuracy of the branch predictor, so I'd expect similar performance from the two techniques if the predictor is presented with similar workloads. (i.e. if it always ends up branching the same way, expect it to be fast; if it's hard to predict, expect it to hurt.) But the only way to know for sure is to run a real test on your target platform.
It depends from processor to processor, but depending on the set of data you're working with, a pipeline flush caused by a mispredicted branch (or badly ordered instructions in some cases) can be more damaging to the speed than a simple cache miss.
In the PowerPC case, for instance, branches not taken (but predicted to be taken) cost about 22 cycles (the time taken to re-fill the pipeline), while a L1 cache miss may cost 600 or so memory cycles. However, if you're going to access contiguous data, it may be better to not branch and let the processor cache-miss your data at the cost of 3 cycles (branches predicted to be taken and taken) for every set of data you're processing.
It all boils down to: test it yourself. The answer is not definitive for all problems.
Since the processor would have to predict the conditional answer in order to plan which instruction has more chances of having to be executed, I would say that the actual cost of the instructions is not important.
Conditional instructions are bad efficiency wise because they make the process flow unpredictable.

What is the limit of optimization using SIMD?

I need to optimize some C code, which does lots of physics computations, using SIMD extensions on the SPE of the Cell Processor. Each vector operator can process 4 floats at the same time. So ideally I would expect a 4x speedup in the most optimistic case.
Do you think the use of vector operators could give bigger speedups?
The best optimization occurs in rethinking the algorithm. Eliminate unnecessary steps. Find more a direct way of accomplishing the same result. Compute the solution in a domain more relevant to the problem.
For example, if the vector array is a list of n which are all on the same line, then it is sufficient to transform the end points only and interpolate the intermediate points.
It CAN give better speeds up than 4 times over straight floating point as the SIMD instructions could be less exact (Not so much as to give too many problems though) and so take fewer cycles to execute. It really depends.
Best plan is to learn as much about the processor you are optimising for as possible. You may find it can give you far better than 4x improvements. You may find out you can't. We can't say though without knowing more about the algorithm you are optimising and what CPU you are targetting.
On their own, no. But if the process of re-writing your algorithms to support them also happens to improve, say, cache locality or branching behaviour, then you could find unrelated speed-ups. However, this is true of any re-write...
This is entirely possible.
You can do more clever instruction-level micro optimizations than a compiler, if you know what you're doing.
Most SIMD instruction sets offers several powerful operations that don't have any equivalent in normal scalar FPU/ALU code (e.g. PAVG/PMIN etc. in SSE2). Even if these don't fit your problem exactly, you can often combine these instructions for great effect.
Not sure about Cell, but most SIMD instruction sets have features to optimize memory access, for example to prefetch data into cache. I've had very good results with these.
Now this isn't Cell or PPC at all, but a simple image convolution filter of mine got a 20x speedup (C vs. SSE2) on Atom, which is higher than the level of parallelity (16 pixels at a time).
It depends on the architecture.. For the moment I assume x86 architecture (aka SSE).
You can get factor four on tight loops easily. Just replace your existing math with SSE instruction and you're done.
You can even get a little more than that because if you use SSE you do the math in registers which are usually not used by the compiler. This frees up the general purpose register for other task such as loop control and address calculation. In short the code that surrounds the SSE instruction will be more compact and execute faster.
And then there is the option to hint the memory controller how you want to access the memory, e.g. if you want to store data in a way that it bypasses the cache or not. For bandwidth hungry algorithms that may give you some more extra speed ontop of that.

What is the maximum theoretical speed-up due to SSE for a simple binary subtraction?

In trying to figure out whether or not my code's inner loop is hitting a hardware design barrier or a lack of understanding on my part barrier. There's a bit more to it, but the simplest question I can come up with to answer is as follows:
If I have the following code:
float px[32768],py[32768],pz[32768];
float xref, yref, zref, deltax, deltay, deltaz;
for(i=0;i<32768-1;i++) {
for(j=0;j<32768-1;j++ {
} }
What type of maximum theoretical speed up would I be able to see by going to SSE instructions in a situation where I have complete control over code (assembly, intrinsics, whatever) but no control over runtime environment other than architecture (i.e. it's a multi-user environment so I can't do anything about how the OS kernel assigns time to my particular process).
Right now I'm seeing a speed up of 3x with my code, when I would have thought using SSE would give me much more vector depth than the 3x speed up is indicating (presumably the 3x speed up tells me I have a 4x maximum theoretical throughput). (I've tried things such as letting deltx/delty/deltz be arrays in case the compiler wasn't smart enough to auto-promote them, but I still see only 3x speed up.) I'm using the intel C compiler with the appropriate compiler flags for vectorization, but no intrinsics obviously.
It depends on the CPU. But the theoretical max won't get above 4x. I don't know of a CPU which can execute more than one SSE instruction per clock cycle, which means that it can at most compute 4 values per cycle.
Most CPU's can do at least one floating point scalar instruction per cycle, so in this case you'd see a theoretical max of a 4x speedup.
But you'll have to look up the specific instruction throughput for the CPU you're running on.
A practical speedup of 3x is pretty good though.
I think you'd probably have to interleave the inner loop somehow. The 3-component vector is getting done at once, but that's only 3 operations at once. To get to 4, you'd do 3 components from the first vector, and 1 from the next, then 2 and 2, and so on. If you established some kind of queue that loads and processes the data 4 components at a time, then separate it after, that might work.
Edit: You could unroll the inner loop to do 4 vectors per iteration (assuming the array size is always a multiple of 4). That would accomplish what I said above.
Consider: How wide is a float? How wide is the SSEx instruction? The ratio should should give you some kind of reasonable upper bound.
It's also worth noting that out-of-order pipes play havok with getting good estimates of speedup.
You should consider loop tiling - the way you are accessing values in the inner loop is probably causing a lot of thrashing in the L1 data cache. It's not too bad, because everything probably still fits in the L2 at 384 KB, but there is easily an order of magnitude difference between an L1 cache hit and an L2 cache hit, so this could make a big difference for you.
