Last used cache line versus different cache lines - c

Let's assume cache lines are 64 bytes wide and I have two arrays a and b which fill a cache line and are also aligned to a cache line. Let's also assume that both arrays are in the L1 cache so when I read from them I don't get a cache miss.
float a[16]; //64 byte aligned e.g. with __attribute__((aligned (64)))
float b[16]; //64 byte aligned
I read a[0]. My question is it faster to now read a[1] than to read b[0]? In other words, is it faster to read from the last used cache line?
Does the set matter? Let's now assume that I have a 32 kb L1 data cache which is 4 way. So if a and b are 8192 bytes apart they end up in the same set. Will this change the answer to my question?
Another way to ask my question (which is what I really care about) is in regards to reading a matrix.
In other words which one of these two code options will be more efficient assuming matrix M fits in the L1 cache and is 64 byte aligned and is already in the L1 cache.
float M[16][16]; //64 byte aligned
Version 1:
for(int i=0; i<16; i++) {
for(int j=0; j<16; j++) {
x += M[i][j];
}
}
Version 2:
for(int i=0; i<16; i++) {
for(int j=0; j<16; j++) {
x += M[j][i];
}
}
Edit: To make this clear due to SSE/AVX lets assume I read the first eight values from a at once with AVX (e.g. with _mm256_load_ps()). Will reading the next eight values from a be faster than reading the first eight values from b (recall that a and b are already in the cache so there will not be a cahce miss)?
Edit:: I'm mostly interested in all processors since Intel Core 2 and Nehalem but I'm currently working with an Ivy Bridge processor and plan to use Haswell soon.

With current Intel processors, there is no performance difference between loading two different cache lines that are both in L1 cache, all else being equal. Given float a[16], b[16]; with a[0] recently loaded, a[1] in the same cache line as a[0], and b[1] not recently loaded but still in L1 cache, then there will be no performance difference between loading a[1] and b[0] in the absence of some other factor.
One thing that can cause a difference is if there has recently been a store to some address that shares some bits with one of the values being loaded, although the entire address is different. Intel processors compare some of the bits of addresses to determine whether they might match a store that is currently in progress. If the bits match, some Intel processors delay the load instruction to give the processor time to resolve the complete virtual address and compare it to the address being stored. However, this is an incidental effect that is not particular to a[1] or b[0].
It is also theoretically possible that a compiler that sees your code is loading both a[0] and a[1] in short succession might make some optimization, such as loading them both with one instruction. My comments above apply to hardware behavior, not C implementation behavior.
With the two-dimensional array scenario, there should still be no difference as long as the entire array M is in L1 cache. However, column traversals of arrays are notorious for performance problems when the array exceeds L1 cache. A problem occurs because addresses are mapped to sets in cache by fixed bits in the address, and each cache set can hold only a limited number of cache lines, such as four. Here is a problem scenario:
An array M has a row length that is a multiple of the distance that results in addresses being mapped to the same cache sets, such as 4096 bytes. E.g., in the array float M[1024][1024];, M[0][0] and M[1][0] are 4096 bytes apart and map to the same cache set.
As you traverse a column of the array, you access M[0][0], M[1][0], M[2][0], M[3][0], and so on. The cache line for each of these elements is loaded into cache.
As you continue along the column, you access M[8][0], M[9][0], and so on. Since each of these uses the same cache set as the previous ones and the cache set can hold only four lines, the earlier lines containing M[0][0] and so on are evicted from cache.
When you complete the column and start the next column by reading M[0][1], the data is no longer in L1 cache, and all of your loads must fetch the data from L2 cache (or worse if you also thrashed L2 cache in the same way).

Fetching a[0] and then either a[1] or b[0] should amount to 2 cache access that hit the L1 in either case. You didn't say which uArch you're using but i'm not familiar with any mechanism that does further "caching" of the full cacheline above the L1 (anywhere in the memory unit), and I don't think such a mechanism could be feasible (at least not for any reasonable price).
Assume you read a[0] and then a[1], and would like to save the effort of accessing the L1 again for that line - your HW would have to not only keep the full cache line somewhere in the memory unit in case it's going to be accessed again (not sure how much that's a common case, so this feature is probably not the effort), but also keep it snoopable as a logical extension of your cache in case some other core tries to modify a[1] between these two reads (which x86 permits for wb memory). In fact, it could even be a store in the same thread context, and you'll have to guard against that (since most common x86 CPUs today are performing loads out of order). If you don't maintain both of these (and probably other safeguards too) - you break coherency, if you do - you've created a monster logic that does that same as your L1 already does, just to save meager 1-2 cycles of access.
However, even though both options would require the same number of cache accesses, there may be other considerations effecting their efficiency, such as L1 banking, same-set access restrictions, lazy LRU updating, etc.. All of which depend on your exact machine implementation.
If you don't focus only on memory/cache access efficiency, your compiler should be able to vectorize accesses to consecutive memory locations, which would still incur the same accesses but will be lighter on execution BW. I think that any decent compiler should be able to unroll your loops at this size, and combine the consecutive accesses into a single vector, but you may be able to help it by using option 1 (especially if there are also writes or other problematic instructions in the middle that would compilcate the job for the compiler)
Edit
Since you're also asking about fitting the matrix in the L2 - that simplifies the question - in that case using the same line(s) multiple times as in option 1 is better as it allows you to hit the L1, while the alternative is to constantly fetch from the L2, which gives you lower latency and bandwidth. This is the basic principle behind loop tiling / blocking

Spatial locality is king so version #1 is faster. A good compiler can even vectorize the reads using SSE/AVX.
The CPU rearranges reads so it doesn't matter which one is first. In out-of-order CPUs it should matter very little if the both cache lines are on the same way.
For large matrices, it is even more important to keep locality so the L1 cache remains hot (less cache misses).

Although I don't know the answer to your question(s) directly (someone else may have more knowledge about processor architecture), have you tried / is it possible to find out the answer yourself by some form of benchmarking?
You can get a high resolution timer by some function such as QueryPerformanceCounter (assuming you're on Windows) or OS equivalent, then iterate the reads you want to test by x amount of times, then get the high resolution timer again to get the average time a read took.
Perform this process again for different reads and you should be able to compare average read times for different types of read, which should answer your question. That's not to say that the answer will remain the same on different processors though.

Related

find nan in array of doubles using simd

This question is very similar to:
SIMD instructions for floating point equality comparison (with NaN == NaN)
Although that question focused on 128 bit vectors and had requirements about identifying +0 and -0.
I had a feeling I might be able to get this one myself but the intel intrinsics guide page seems to be down :/
My goal is to take an array of doubles and to return whether a NaN is present in the array. I am expecting that the majority of the time that there won't be one, and would like that route to have the best performance.
Initially I was going to do a comparison of 4 doubles to themselves, mirroring the non-SIMD approach for NaN detection (i.e. NaN only value where a != a is true). Something like:
data *double = ...
__m256d a, b;
int temp = 0;
//This bit would be in a loop over the array
//I'd probably put a sentinel in and loop over while !temp
a = _mm256_loadu_pd(data);
b = _mm256_cmp_pd(a, a, _CMP_NEQ_UQ);
temp = temp | _mm256_movemask_pd(b);
However, in some of the examples of comparison it looks like there is some sort of NaN detection already going on in addition to the comparison itself. I briefly thought, well if something like _CMP_EQ_UQ will detect NaNs, I can just use that and then I can compare 4 doubles to 4 doubles and magically look at 8 doubles at once at the same time.
__m256d a, b, c;
a = _mm256_loadu_pd(data);
b = _mm256_loadu_pd(data+4);
c = _mm256_cmp_pd(a, b, _CMP_EQ_UQ);
At this point I realized I wasn't quite thinking straight because I might happen to compare a number to itself that is not a NaN (i.e. 3 == 3) and get a hit that way.
So my question is, is comparing 4 doubles to themselves (as done above) the best I can do or is there some other better approach to finding out whether my array has a NaN?
You might be able to avoid this entirely by checking fenv status, or if not then cache block it and/or fold it into another pass over the same data, because it's very low computational intensity (work per byte loaded/stored), so it easily bottlenecks on memory bandwidth. See below.
The comparison predicate you're looking for is _CMP_UNORD_Q or _CMP_ORD_Q to tell you that the comparison is unordered or ordered, i.e. that at least one of the operands is a NaN, or that both operands are non-NaN, respectively. What does ordered / unordered comparison mean?
The asm docs for cmppd list the predicates and have equal or better details than the intrinsics guide.
So yes, if you expect NaN to be rare and want to quickly scan through lots of non-NaN values, you can vcmppd two different vectors against each other. If you cared about where the NaN was, you could do extra work to sort that out once you know that there is at least one in either of two input vectors. (Like _mm256_cmp_pd(a,a, _CMP_UNORD_Q) to feed movemask + bitscan for lowest set bit.)
OR or AND multiple compares per movemask
Like with other SSE/AVX search loops, you can also amortize the movemask cost by combining a few compare results with _mm256_or_pd (find any unordered) or _mm256_and_pd (check for all ordered). E.g. check a couple cache lines (4x _mm256d with 2x _mm256_cmp_pd) per movemask / test/branch. (glibc's asm memchr and strlen use this trick.) Again, this optimizes for your common case where you expect no early-outs and have to scan the whole array.
Also remember that it's totally fine to check the same element twice, so your cleanup can be simple: a vector that loads up to the end of the array, potentially overlapping with elements you already checked.
// checks 4 vectors = 16 doubles
// non-zero means there was a NaN somewhere in p[0..15]
static inline
int any_nan_block(double *p) {
__m256d a = _mm256_loadu_pd(p+0);
__m256d abnan = _mm256_cmp_pd(a, _mm256_loadu_pd(p+ 4), _CMP_UNORD_Q);
__m256d c = _mm256_loadu_pd(p+8);
__m256d cdnan = _mm256_cmp_pd(c, _mm256_loadu_pd(p+12), _CMP_UNORD_Q);
__m256d abcdnan = _mm256_or_pd(abnan, cdnan);
return _mm256_movemask_pd(abcdnan);
}
// more aggressive ORing is possible but probably not needed
// especially if you expect any memory bottlenecks.
I wrote the C as if it were assembly, one instruction per source line. (load / memory-source cmppd). These 6 instructions are all single-uop in the fused-domain on modern CPUs, if using non-indexed addressing modes on Intel. test/jnz as a break condition would bring it up to 7 uops.
In a loop, an add reg, 16*8 pointer increment is another 1 uop, and cmp / jne as a loop condition is one more, bringing it up to 9 uops. So unfortunately on Skylake this bottlenecks on the front-end at 4 uops / clock, taking at least 9/4 cycles to issue 1 iteration, not quite saturating the load ports. Zen 2 or Ice Lake could sustain 2 loads per clock without any more unrolling or another level of vorpd combining.
Another trick that might be possible is to use vptest or vtestpd on two vectors to check that they're both non-zero. But I'm not sure it's possible to correctly check that every element of both vectors is non-zero. Can PTEST be used to test if two registers are both zero or some other condition? shows that the other way (that _CMP_UNORD_Q inputs are both all-zero) is not possible.
But this wouldn't really help: vtestpd / jcc is 3 uops total, vs. vorpd / vmovmskpd / test+jcc also being 3 fused-domain uops on existing Intel/AMD CPUs with AVX, so it's not even a win for throughput when you're branching on the result. So even if it's possible, it's probably break even, although it might save a bit of code size. And wouldn't be worth considering if it takes more than one branch to sort out the all-zeros or mix_zeros_and_ones cases from the all-ones case.
Avoiding work: check fenv flags instead
If your array was the result of computation in this thread, just check the FP exception sticky flags (in MXCSR manually, or via fenv.h fegetexcept) to see if an FP "invalid" exception has happened since you last cleared FP exceptions. If not, I think that means the FPU hasn't produced any NaN outputs and thus there are none in arrays written since then by this thread.
If it is set, you'll have to check; the invalid exception might have been raised for a temporary result that didn't propagate into this array.
Cache blocking:
If/when fenv flags don't let you avoid the work entirely, or aren't a good strategy for your program, try to fold this check into whatever produced the array, or into the next pass that reads it. So you're reusing data while it's already loaded into vector registers, increasing computational intensity. (ALU work per load/store.)
Even if data is already hot in L1d, it will still bottleneck on load port bandwidth: 2 loads per cmppd still bottlenecks on 2/clock load port bandwidth, on CPUs with 2/clock vcmppd ymm (Skylake but not Haswell).
Also worthwhile to align your pointers to make sure you're getting full load throughput from L1d cache, especially if data is sometimes already hot in L1d.
Or at least cache-block it so you check a 128kiB block before running another loop on that same block while it's hot in cache. That's half the size of 256k L2 so your data should still be hot from the previous pass, and/or hot for the next pass.
Definitely avoid running this over a whole multi-megabyte array and paying the cost of getting it into the CPU core from DRAM or L3 cache, then evicting again before another loop reads it. That's worst case computational intensity, paying the cost of getting it into a CPU core's private cache more than once.

prefetching data at L1 and L2

In Agner Fog's manual Optimizing software in C++ in section 9.10 "Cahce contentions in large data structures" he describes a problem transposing a matrix when the matrix width is equal to something called the critical stride. In his test the cost for for a matrix in L1 is 40% greater when the width is equal to the critical stride. If the matrix is is even larger and only fits in L2 the cost is 600%! This is summed up nicely in Table 9.1 in his text. This is essential the same thing observed at
Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
Later he writes:
The reason why this effect is so much stronger for
level-2 cache contentions than for level-1 cache contentions is that the level-2 cache cannot
prefetch more than one line at a time.
So my questions are related to prefetching data.
From his comment I infer that L1 can prefetch more than one cache line at a time. How many can it prefetch?
From what I understand trying to write code to prefetch the data (e.g. with _mm_prefetch) is rarely ever helpful. The only example I have read of is Prefetching Examples? and it's only a O(10%) improvement (on some machines). Agner later explains this:
The reason is that modern processors prefetch data automatically thanks to
out-of-order execution and advanced prediction mechanisms. Modern microprocessors are
able to automatically prefetch data for regular access patterns containing multiple streams
with different strides. Therefore, you don't have to prefetch data explicitly if data access can
be arranged in regular patterns with fixed strides.
So how does the CPU decide which data to prefetch and are there ways to help the CPU make better choices for the prefetching (e.g. "regular patterns with fixed strides")?
Edit: Based on a comment by Leeor let me add to my questions and make it more interesting. Why does the critical stride have so much more of an effect on L2 compared to L1?
Edit: I tried to reproduce Agner Fog's table using the code at Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
I ran this with MSVC2013 64-bit release mode on a Xeon E5 1620 (Ivy Bridge) which has L1 32KB 8-way, L2 256 KB 8-way, and L3 10MB 20-way. The max matrix size for L1 is about 90x90, 256x256 for L3, and 1619 for L3.
Matrix Size Average Time
64x64 0.004251 0.004472 0.004412 (three times)
65x65 0.004422 0.004442 0.004632 (three times)
128x128 0.0409
129x129 0.0169
256x256 0.219 //max L2 matrix size
257x257 0.0692
512x512 2.701
513x513 0.649
1024x1024 12.8
1025x1025 10.1
I'm not seeing any performance loss in L1 however L2 clearly has the critical stride problem and maybe L3. I'm not sure yet why L1 does not show a problem. It's possible there is some other source of background (overhead) which is dominating the L1 times.
This statement :
the level-2 cache cannot prefetch more than one line at a time.
is incorrect
In fact, the L2 prefetchers are often stronger and more aggressive than L1 prefetchers. It depends on the actual machine you use, but Intels' L2 prefetcher for e.g. can trigger 2 prefetches for each request, while the L1 is usually limited (there are several types of prefetches that can coexist in the L1, but they're likely to be competing on a more limited BW than the L2 has at its disposal, so there will probably be less prefetches coming out of the L1.
The optimization guide, in Section 2.3.5.4 (Data Prefetching) counts the following prefetcher types:
Two hardware prefetchers load data to the L1 DCache:
- Data cache unit (DCU) prefetcher: This prefetcher, also known as the streaming prefetcher, is triggered by an ascending access to very recently loaded data. The processor assumes that this access is part of a streaming algorithm and automatically fetches the next line.
- Instruction pointer (IP)-based stride prefetcher: This prefetcher keeps track of individual load instructions. If a load instruction is detected to have a regular stride, then a prefetch is sent to the next address which is the sum of the current address and the stride. This prefetcher can prefetch forward or backward and can detect strides of up to 2K bytes.
Data Prefetch to the L2 and Last Level Cache -
- Spatial Prefetcher: This prefetcher strives to complete every cache line fetched to the L2 cache with the pair line that completes it to a 128-byte aligned chunk.
- Streamer: This prefetcher monitors read requests from the L1 cache for ascending and descending sequences of addresses. Monitored read requests include L1 DCache requests initiated by load and store operations and by the hardware prefetchers, and L1 ICache requests for code fetch. When a forward or backward stream of requests is detected, the anticipated cache lines are prefetched. Prefetched cache lines must be in the same 4K page.
And a bit further ahead:
... The streamer may issue two prefetch requests on every L2 lookup. The streamer can run up to 20 lines ahead of the load request.
Of the above, only the IP-based can handle strides greater than one cache line (the streaming ones can deal with anything that uses consecutive cachelines, meaning up to 64byte stride (or actually up to 128 bytes if you don't mind some extra lines). To use that, make sure that loads/stores at a given address would perform strided accesses - that's usually the case already in loops going over arrays. Compiler loop-unrolling may split that into multiple different stride streams with larger strides - that would work even better (the lookahead would be larger), unless you exceed the number of outstanding tracked IPs - again, that depends on the exact implementation.
However, if your access pattern does consist of consecutive lines, the L2 streamer is much more efficient than the L1 since it runs ahead faster.

Performance Optimization for Matrix Rotation

I'm now trapped by a performance optimization lab in the book "Computer System from a Programmer's Perspective" described as following:
In a N*N matrix M, where N is multiple of 32, the rotate operation can be represented as:
Transpose: interchange elements M(i,j) and M(j,i)
Exchange rows: Row i is exchanged with row N-1-i
A example for matrix rotation(N is 3 instead of 32 for simplicity):
------- -------
|1|2|3| |3|6|9|
------- -------
|4|5|6| after rotate is |2|5|8|
------- -------
|7|8|9| |1|4|7|
------- -------
A naive implementation is:
#define RIDX(i,j,n) ((i)*(n)+(j))
void naive_rotate(int dim, pixel *src, pixel *dst)
{
int i, j;
for (i = 0; i < dim; i++)
for (j = 0; j < dim; j++)
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}
I come up with an idea by inner-loop-unroll. The result is:
Code Version Speed Up
original 1x
unrolled by 2 1.33x
unrolled by 4 1.33x
unrolled by 8 1.55x
unrolled by 16 1.67x
unrolled by 32 1.61x
I also get a code snippet from pastebin.com that seems can solve this problem:
void rotate(int dim, pixel *src, pixel *dst)
{
int stride = 32;
int count = dim >> 5;
src += dim - 1;
int a1 = count;
do {
int a2 = dim;
do {
int a3 = stride;
do {
*dst++ = *src;
src += dim;
} while(--a3);
src -= dim * stride + 1;
dst += dim - stride;
} while(--a2);
src += dim * (stride + 1);
dst -= dim * dim - stride;
} while(--a1);
}
After carefully read the code, I think main idea of this solution is treat 32 rows as a data zone, and perform the rotating operation respectively. Speed up of this version is 1.85x, overwhelming all the loop-unroll version.
Here are the questions:
In the inner-loop-unroll version, why does increment slow down if the unrolling factor increase, especially change the unrolling factor from 8 to 16, which does not effect the same when switch from 4 to 8? Does the result have some relationship with depth of the CPU pipeline? If the answer is yes, could the degrade of increment reflect pipeline length?
What is the probable reason for the optimization of data-zone version? It seems that there is no too much essential difference from the original naive version.
EDIT:
My test environment is Intel Centrino Duo architecture and the verion of gcc is 4.4
Any advice will be highly appreciated!
Kind regards!
What kind of processor are you testing this on? I dimly remember that unrolling loops helps when the processor can handle multiple operations at once, but only up to the maximum number of parallel executions. So if your processor can only handle 8 simultaneous instructions, then unrolling to 16 won't help. But someone with knowledge of more recent processor design will have to pipe up/correct me.
EDIT: According to this PDF, the centrino core2 duo has two processors, each of which is capable of 4 simultaneous instructions. It's generally not so simple, though. Unless your compiler is optimizing across both cores (ie, when you run the task manager (if you're on windows, top if you're on linux), you'll see that CPU usage is maxed out), your process will be running on one core at a time. The processor also features 14 stages of execution, so if you can keep the pipeline full, you'll get a faster execution.
Continuing along the theoretical route, then, you get a speed improvement of 33% with a single unroll because you're starting to take advantage of simultaneous instruction execution. Going to 4 unrolls doesn't really help, because you're now still within that 4-simultaneous-instruction limit. Going to 8 unrolls helps because the processor can now fill the pipeline more completely, so more instructions will get executed per clock cycle.
For this last, think about how a McDonald's drive through works (I think that that's relatively widespread?). A car enters the drivethrough, orders at one window, pays at a second window, and receives food at a third window. If a second drive enters when the first is still ordering, then by the time both finish (assuming each operation in the drive through takes one 'cycle' or time unit), then 2 full operations will be done by the time 4 cycles have elapsed. If each car did all of their operations at one window, then the first car would take 3 cycles for ordering, paying, and getting food, and then the second car would also take 3 cycles for ordering, paying and getting food, for a total of 6 cycles. So, operation time due to pipelining decreases.
Of course, you have to keep the pipeline full to get the largest speed improvement. 14 stages is a lot of stages, so going to 16 unrolls will give you some improvement still because more operations can be in the pipeline.
Going to 32 causing a decrease in performance may have to do with bandwidth to the processor from the cache (again a guess, can't know for sure without seeing your code exactly, as well as the machine code). If all the instructions can't fit into cache or into the registers, then there is some time necessary to prepare them all to run (ie, people have to get into their cars and get to the drive through in the first place). There will be some reduction in speed if they all get there all at once, and some shuffling of the line has to be done to make the operation proceed.
Note that each movement from src to dst is not free or a single operation. You have the lookups into the arrays, and that costs time.
As for why the second version works so quickly, I'm going to hazard a guess that it has to do with the [] operator. Every time that gets called, you're doing some lookups into both the src and dst arrays, resolving pointers to locations, and then retrieving the memory. The other code is going straight to the pointers of the arrays and accessing them directly; basically, for each of the movements from src to dst, there are less operations involved in the move, because the lookups have been handled explicitly through pointer placement. If you use [], these steps are followed:
do any math inside the []
take a pointer to that location (startOfArray + [] in memory)
return the result of that location in memory
If you walk along with a pointer, you just do the math to do the walk (typically just an addition, no multiplication) and then return the result, because you've already done the second step.
If I'm right, then you might get better results with the second code by unrolling its inner loop as well, so that multiple operations can be pipelined simultaneously.
The first part of the question I'm not sure about. My initial thought was some sort of cache problem, but you're only accessing each item once.
The other code could be faster for a coupe reasons.
1) The loops count down instead of up. Comparing a loop counter to zero costs nothing on most architectures (a flag is set by the decrement automatically) you have to explicitly compare to a max value with each iteration.
2) There is no math in the inner loop. You are doing a bunch of math in your inner loop. I see 2 subtractions in the main code and a multiply in the macro (which is used twice). There is also the implicit addition of the resulting indexes to the base address of the array which is avoided by the use of pointers (good addressing modes on x86 should eliminate this penalty too).
When writing optimized code, you always construct it bottom up from the inside. This means taking the inner-most loop and reducing its content to nearly zero. In this case, moving data is unavoidable. Incrementing a pointer is the bare minimum to get to the next item, the other pointer needs to add an offset to get to its next item. So at a minimum we have 4 operations: load, store, increment, add. If an architecture supported "move with post-increment" this would be 2 instructions total. On Intel I suspect it's 3 or 4 instructions. Anything more than this like subtractions and multiplication is going to add significant code.
Looking at the assembly code of each version should offer much insight.
If you run this repeatedly on a small matrix (32x32) that fits completely in cache you should should see even more dramatic differences in implementations. Running on a 1024x1024 matrix will be much slower than doing 1024 rotations of a single 32x32 even though the number of data copies is the same.
The main purpose of loop unrolling is to reduce the time spent on the loop control (test for completion, incrementing counters, etc...). This is a case of diminishing returns though, since as the loop is unrolled more and more, the time spent on loop control becomes less and less significant. Like mmr said, loop unrolling may also help the compiler to execute things in parallel, but only up to a point.
The "data-zone" algorithm appears to be a version of a cache efficient matrix transpose algorithm. The problem with computing a transpose the naive way is that it results in a lot of cache misses. For the source array, you are accessing the memory along each row, so it is accessed in a linear manner, element-by-element. However, this requires that you access the destination array along the columns, meaning you are jumping dim elements each time you access an element. Basically, for each row of the input, you are traversing the memory of the entire destination matrix. Since the whole matrix probably won't fit in the cache, memory has to be loaded and unloaded from the cache very often.
The "data-zone" algorithm takes the matrix that you are accessing by column and only performs the transpose for 32 rows at a time, so the amount of memory you are traversing is 32xstride, which should hopefully fit completely into the cache. Basically the aim is to work on sub-sections that fit in the cache and reduce the amount of jumping around in memory.

Design code to fit in CPU Cache?

When writing simulations my buddy says he likes to try to write the program small enough to fit into cache. Does this have any real meaning? I understand that cache is faster than RAM and the main memory. Is it possible to specify that you want the program to run from cache or at least load the variables into cache? We are writing simulations so any performance/optimization gain is a huge benefit.
If you know of any good links explaining CPU caching, then point me in that direction.
At least with a typical desktop CPU, you can't really specify much about cache usage directly. You can still try to write cache-friendly code though. On the code side, this often means unrolling loops (for just one obvious example) is rarely useful -- it expands the code, and a modern CPU typically minimizes the overhead of looping. You can generally do more on the data side, to improve locality of reference, protect against false sharing (e.g. two frequently-used pieces of data that will try to use the same part of the cache, while other parts remain unused).
Edit (to make some points a bit more explicit):
A typical CPU has a number of different caches. A modern desktop processor will typically have at least 2 and often 3 levels of cache. By (at least nearly) universal agreement, "level 1" is the cache "closest" to the processing elements, and the numbers go up from there (level 2 is next, level 3 after that, etc.)
In most cases, (at least) the level 1 cache is split into two halves: an instruction cache and a data cache (the Intel 486 is nearly the sole exception of which I'm aware, with a single cache for both instructions and data--but it's so thoroughly obsolete it probably doesn't merit a lot of thought).
In most cases, a cache is organized as a set of "lines". The contents of a cache is normally read, written, and tracked one line at a time. In other words, if the CPU is going to use data from any part of a cache line, that entire cache line is read from the next lower level of storage. Caches that are closer to the CPU are generally smaller and have smaller cache lines.
This basic architecture leads to most of the characteristics of a cache that matter in writing code. As much as possible, you want to read something into cache once, do everything with it you're going to, then move on to something else.
This means that as you're processing data, it's typically better to read a relatively small amount of data (little enough to fit in the cache), do as much processing on that data as you can, then move on to the next chunk of data. Algorithms like Quicksort that quickly break large amounts of input in to progressively smaller pieces do this more or less automatically, so they tend to be fairly cache-friendly, almost regardless of the precise details of the cache.
This also has implications for how you write code. If you have a loop like:
for i = 0 to whatever
step1(data);
step2(data);
step3(data);
end for
You're generally better off stringing as many of the steps together as you can up to the amount that will fit in the cache. The minute you overflow the cache, performance can/will drop drastically. If the code for step 3 above was large enough that it wouldn't fit into the cache, you'd generally be better off breaking the loop up into two pieces like this (if possible):
for i = 0 to whatever
step1(data);
step2(data);
end for
for i = 0 to whatever
step3(data);
end for
Loop unrolling is a fairly hotly contested subject. On one hand, it can lead to code that's much more CPU-friendly, reducing the overhead of instructions executed for the loop itself. At the same time, it can (and generally does) increase code size, so it's relatively cache unfriendly. My own experience is that in synthetic benchmarks that tend to do really small amounts of processing on really large amounts of data, that you gain a lot from loop unrolling. In more practical code where you tend to have more processing on an individual piece of data, you gain a lot less--and overflowing the cache leading to a serious performance loss isn't particularly rare at all.
The data cache is also limited in size. This means that you generally want your data packed as densely as possible so as much data as possible will fit in the cache. Just for one obvious example, a data structure that's linked together with pointers needs to gain quite a bit in terms of computational complexity to make up for the amount of data cache space used by those pointers. If you're going to use a linked data structure, you generally want to at least ensure you're linking together relatively large pieces of data.
In a lot of cases, however, I've found that tricks I originally learned for fitting data into minuscule amounts of memory in tiny processors that have been (mostly) obsolete for decades, works out pretty well on modern processors. The intent is now to fit more data in the cache instead of the main memory, but the effect is nearly the same. In quite a few cases, you can think of CPU instructions as nearly free, and the overall speed of execution is governed by the bandwidth to the cache (or the main memory), so extra processing to unpack data from a dense format works out in your favor. This is particularly true when you're dealing with enough data that it won't all fit in the cache at all any more, so the overall speed is governed by the bandwidth to main memory. In this case, you can execute a lot of instructions to save a few memory reads, and still come out ahead.
Parallel processing can exacerbate that problem. In many cases, rewriting code to allow parallel processing can lead to virtually no gain in performance, or sometimes even a performance loss. If the overall speed is governed by the bandwidth from the CPU to memory, having more cores competing for that bandwidth is unlikely to do any good (and may do substantial harm). In such a case, use of multiple cores to improve speed often comes down to doing even more to pack the data more tightly, and taking advantage of even more processing power to unpack the data, so the real speed gain is from reducing the bandwidth consumed, and the extra cores just keep from losing time to unpacking the data from the denser format.
Another cache-based problem that can arise in parallel coding is sharing (and false sharing) of variables. If two (or more) cores need to write to the same location in memory, the cache line holding that data can end up being shuttled back and forth between the cores to give each core access to the shared data. The result is often code that runs slower in parallel than it did in serial (i.e., on a single core). There's a variation of this called "false sharing", in which the code on the different cores is writing to separate data, but the data for the different cores ends up in the same cache line. Since the cache controls data purely in terms of entire lines of data, the data gets shuffled back and forth between the cores anyway, leading to exactly the same problem.
Here's a link to a really good paper on caches/memory optimization by Christer Ericsson (of God of War I/II/III fame). It's a couple of years old but it's still very relevant.
A useful paper that will tell you more than you ever wanted to know about caches is What Every Programmer Should Know About Memory by Ulrich Drepper. Hennessey covers it very thoroughly. Christer and Mike Acton have written a bunch of good stuff about this too.
I think you should worry more about data cache than instruction cache — in my experience, dcache misses are more frequent, more painful, and more usefully fixed.
UPDATE: 1/13/2014
According to this senior chip designer, cache misses are now THE overwhelmingly dominant factor in code performance, so we're basically all the way back to the mid-80s and fast 286 chips in terms of the relative performance bottlenecks of load, store, integer arithmetic, and cache misses.
A Crash Course In Modern Hardware by Cliff Click # Azul
.
.
.
.
.
--- we now return you to your regularly scheduled program ---
Sometimes an example is better than a description of how to do something. In that spirit here's a particularly successful example of how I changed some code to better use on chip caches. This was done some time ago on a 486 CPU and latter migrated to a 1st Generation Pentium CPU. The effect on performance was similar.
Example: Subscript Mapping
Here's an example of a technique I used to fit data into the chip's cache that has general purpose utility.
I had a double float vector that was 1,250 elements long, which was an epidemiology curve with very long tails. The "interesting" part of the curve only had about 200 unique values but I didn't want a 2-sided if() test to make a mess of the CPU's pipeline(thus the long tails, which could use as subscripts the most extreme values the Monte Carlo code would spit out), and I needed the branch prediction logic for a dozen other conditional tests inside the "hot-spot" in the code.
I settled on a scheme where I used a vector of 8-bit ints as a subscript into the double vector, which I shortened to 256 elements. The tiny ints all had the same values before 128 ahead of zero, and 128 after zero, so except for the middle 256 values, they all pointed to either the first or last value in the double vector.
This shrunk the storage requirement to 2k for the doubles, and 1,250 bytes for the 8-bit subscripts. This shrunk 10,000 bytes down to 3,298. Since the program spent 90% or more of it's time in this inner-loop, the 2 vectors never got pushed out of the 8k data cache. The program immediately doubled its performance. This code got hit ~ 100 billion times in the process of computing an OAS value for 1+ million mortgage loans.
Since the tails of the curve were seldom touched, it's very possible that only the middle 200-300 elements of the tiny int vector were actually kept in cache, along with 160-240 middle doubles representing 1/8ths of percents of interest. It was a remarkable increase in performance, accomplished in an afternoon, on a program that I'd spent over a year optimizing.
I agree with Jerry, as it has been my experience also, that tilting the code towards the instruction cache is not nearly as successful as optimizing for the data cache/s. This is one reason I think AMD's common caches are not as helpful as Intel's separate data and instruction caches. IE: you don't want instructions hogging up the cache, as it just isn't very helpful. In part this is because CISC instruction sets were originally created to make up for the vast difference between CPU and memory speeds, and except for an aberration in the late 80's, that's pretty much always been true.
Another favorite technique I use to favor the data cache, and savage the instruction cache, is by using a lot of bit-ints in structure definitions, and the smallest possible data sizes in general. To mask off a 4-bit int to hold the month of the year, or 9 bits to hold the day of the year, etc, etc, requires the CPU use masks to mask off the host integers the bits are using, which shrinks the data, effectively increases cache and bus sizes, but requires more instructions. While this technique produces code that doesn't perform as well on synthetic benchmarks, on busy systems where users and processes are competing for resources, it works wonderfully.
Mostly this will serve as a placeholder until I get time to do this topic justice, but I wanted to share what I consider to be a truly groundbreaking milestone - the introduction of dedicated bit manipulation instructions in the new Intel Hazwell microprocessor.
It became painfully obvious when I wrote some code here on StackOverflow to reverse the bits in a 4096 bit array that 30+ yrs after the introduction of the PC, microprocessors just don't devote much attention or resources to bits, and that I hope will change. In particular, I'd love to see, for starters, the bool type become an actual bit datatype in C/C++, instead of the ridiculously wasteful byte it currently is.
UPDATE: 12/29/2013
I recently had occasion to optimize a ring buffer which keeps track of 512 different resource users' demands on a system at millisecond granularity. There is a timer which fires every millisecond which added the sum of the most current slice's resource requests and subtracted out the 1,000th time slice's requests, comprising resource requests now 1,000 milliseconds old.
The Head, Tail vectors were right next to each other in memory, except when first the Head, and then the Tail wrapped and started back at the beginning of the array. The (rolling)Summary slice however was in a fixed, statically allocated array that wasn't particularly close to either of those, and wasn't even allocated from the heap.
Thinking about this, and studying the code a few particulars caught my attention.
The demands that were coming in were added to the Head and the Summary slice at the same time, right next to each other in adjacent lines of code.
When the timer fired, the Tail was subtracted out of the Summary slice, and the results were left in the Summary slice, as you'd expect
The 2nd function called when the timer fired advanced all the pointers servicing the ring. In particular....
The Head overwrote the Tail, thereby occupying the same memory location
The new Tail occupied the next 512 memory locations, or wrapped
The user wanted more flexibility in the number of demands being managed, from 512 to 4098, or perhaps more. I felt the most robust, idiot-proof way to do this was to allocate both the 1,000 time slices and the summary slice all together as one contiguous block of memory so that it would be IMPOSSIBLE for the Summary slice to end up being a different length than the other 1,000 time slices.
Given the above, I began to wonder if I could get more performance if, instead of having the Summary slice remain in one location, I had it "roam" between the Head and the Tail, so it was always right next to the Head for adding new demands, and right next to the Tail when the timer fired and the Tail's values had to be subtracted from the Summary.
I did exactly this, but then found a couple of additional optimizations in the process. I changed the code that calculated the rolling Summary so that it left the results in the Tail, instead of the Summary slice. Why? Because the very next function was performing a memcpy() to move the Summary slice into the memory just occupied by the Tail. (weird but true, the Tail leads the Head until the end of the ring when it wraps). By leaving the results of the summation in the Tail, I didn't have to perform the memcpy(), I just had to assign pTail to pSummary.
In a similar way, the new Head occupied the now stale Summary slice's old memory location, so again, I just assigned pSummary to pHead, and zeroed all its values with a memset to zero.
Leading the way to the end of the ring(really a drum, 512 tracks wide) was the Tail, but I only had to compare its pointer against a constant pEndOfRing pointer to detect that condition. All of the other pointers could be assigned the pointer value of the vector just ahead of it. IE: I only needed a conditional test for 1:3 of the pointers to correctly wrap them.
The initial design had used byte ints to maximize cache usage, however, I was able to relax this constraint - satisfying the users request to handle higher resource counts per user per millisecond - to use unsigned shorts and STILL double performance, because even with 3 adjacent vectors of 512 unsigned shorts, the L1 cache's 32K data cache could easily hold the required 3,720 bytes, 2/3rds of which were in locations just used. Only when the Tail, Summary, or Head wrapped were 1 of the 3 separated by any significant "step" in the 8MB L3cache.
The total run-time memory footprint for this code is under 2MB, so it runs entirely out of on-chip caches, and even on an i7 chip with 4 cores, 4 instances of this process can be run without any degradation in performance at all, and total throughput goes up slightly with 5 processes running. It's an Opus Magnum on cache usage.
Most C/C++ compilers prefer to optimize for size rather than for "speed". That is, smaller code generally executes faster than unrolled code because of cache effects.
If I were you, I would make sure I know which parts of code are hotspots, which I define as
a tight loop not containing any function calls, because if it calls any function, then the PC will be spending most of its time in that function,
that accounts for a significant fraction of execution time (like >= 10%) which you can determine from a profiler. (I just sample the stack manually.)
If you have such a hotspot, then it should fit in the cache. I'm not sure how you tell it to do that, but I suspect it's automatic.

How cache memory works?

Today when I was in computer organization class, teacher talked about something interesting to me. When it comes to talk about Why cache memory works, he said that:
for (i=0; i<M; i++)
for(j=0; j<N; j++)
X[i][j] = X[i][j] + K; //X is double(8 bytes)
it is not good to change the first line with the second. What is your opinions on this? And why it is like that?
There is a very good paper by Ulrich Drepper of Red Hat and glibc fame, What Every Programmer Should Know About Memory. One section discussed caches in great detail. For example, there are cache effects in SMP systems where CPUs can end up thrashing ownership of a modified cache line back and forth, greatly harming performance.
Locality of reference. Because the data is stored by rows, for each row the j columns are in adjacent memory addresses. The OS will typically load an entire page from memory into the cache and adjacent address references will likely refer to that same page. If you increment by the row index in the inner loop it is possible that these rows will be on different pages (since they are separated by j doubles each) and the cache may have to constantly bring in and throw away pages of memory as it references the data. This is called thrashing and is bad for performance.
In practice and with larger, modern caches, the sizes of the rows/columns would need to be reasonably large before this would come into play, but it's still good practice.
[EDIT] The answer above is specific to C and may differ for other languages. The only one that I know is different is FORTRAN. FORTRAN stores things in column major order (the above is row major) and it would be correct to change the order of the statements in FORTRAN. If you want/need efficiency, it's important to know how your language implements data storage.
It is like that becauses caches like locality. The same number of memory accessed, but spaced further apart, will hit different "lines" of cache, or might even miss the cache altogether. It is therefore good, whenever you have the choice, to organize data so that accesses that are likely to happen close to each other in time, also do so in space. This increases the chance of a cache hit, and gives you more performance.
There is of course a wealth of information about this topic available, see for instancethis wikipedia entry on locality of reference. Or, I guess, your own course text book. :)
In C, n-dimensional matrices are row major, meaning the last index into the matrix represents adjacent spaces in memory. This is different than some other languages, FORTRAN for example, which are column major. In FORTRAN, it's more efficient to iterate through a 2D matrix like this:
do jj = 1,N
do ii = 1,M
x(ii,jj) = x(ii,jj) + K;
enddo
enddo
Cache memory is very fast and very expensive memory that sits close to the CPU. Rather than fetch one small piece of data from the RAM each time, the CPU fetches a chunk of data and stores it in the cache. The bet is that if you just read one byte, then the next byte you read is likely to be right after it. If this is the case, then it can come from the cache.
By laying out your loop as you have it, you read the bytes in the order that they are stored in memory. This means that they are in the cache, and can be read very quickly by the CPU. If you swapped around lines 1 and 2, then you'd read every "N" bytes each time around the loop. The bytes you are reading are no longer consecutive in memory, and so they may not be in the cache. The CPU has to fetch them from the (slower) RAM, and so your performance decreases.

Resources