Hit / Miss rate counting by array caching - c

I'm reading Computer Systems book from Bryant & O'Hallaron, there is an exercises the solution of which seems to be incorrect. So I'd like to make it sure
given
struct point {
int x;
int y; };
struct array[32][32];
for(i = 31; i >= 0; i--) {
for(j = 31; j >= 0; j--) {
sum_x += array[j][i].x;
sum_y += array[j][i].y; }}
sizeof(int) = 4;
we have 4096 byte cache with block (line) size 32 byte.
The hit rate is asked.
My reasoning was, we have 4096/32 = 128 blocks, each block can store 4 points (2*4*4 = 32), therefore the cache can store 1/2 of the array, i.e. 512 points (total 32*32 = 1024). Since the code accesses array in column major order, access to each point is miss. So we have array[j][i].x is always miss, while array[j][i].y is hit. Finally miss rate = hit rate = 1/2.
Problem: The solution says the hit rate is 3/4 because the cache can store the whole array.
But according to my reasoning the cache can store only half points
Did I miss something?

The array's top four rows occupy a part of the cache:
|*ooooooooooooooooooooooooooooooo|
|*ooooooooooooooooooooooooooooooo|
|*ooooooooooooooooooooooooooooooo|
|*ooooooooooooooooooooooooooooooo|
|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|...
Above is a schematic of the array as an applied mathematician would write the array on paper. Each element consists of an (x,y) pair, a point.
The four rows labeled o in the diagram comprise 128 points, enough to fill 1024 bytes, which is only one quarter of the cache, but see: in your code, the variable i is
the major loop counter and also
the array's row index (as written on paper).
So, let's look at the diagram again. How do your nested loops step through the array as diagrammed?
Answer: apparently, your loops step rightward across the top row as diagrammed, with j (column) as the minor loop counter. However, as you have observed, the array is stored by columns. Therefore, when element [j][i] == [0][0] is loaded, an entire cache line is loaded with it. And what comprises that cache line? It's the four elements marked * in the diagram.
Therefore, while your inner loop iterates across the array's top row as diagrammed, the cache misses every time, fetching four elements each time. And then for the next three rows, it's all hits.
This isn't easy to think about. It's a fine problem, nor would I expect you to grasp my answer instantly, but if you carefully consider the sequence of loads as I have explained, it should (after a bit of pondering) begin to make sense.
With the given loop nesting, the hit rate is indeed 3/4.
FURTHER DISCUSSION
In comments, you have asked a good follow-up question:
Can you write an element (e.g. array[3][14].x) that would hit?
I can. The array[j][i] == array[10][5] would hit. (Both .x and .y would hit.)
I will explain. The array[j][i] == array[10][4] would miss, whereas array[10][5], array[10][6] and array[10][7] would eventually hit. Why eventually? This is significant. Although all four of the elements I have named are loaded by cache hardware at once, array[10][5] is not accessed by your code (that is, by the CPU) when array[10][4] is accessed. Rather, after array[10][4] is accessed, array[11][4] is next accessed by the program and CPU.
The program and CPU only get around to accessing array[10][5] rather later.
And, indeed, if you think about it, this makes sense, doesn't it, because that is part of what caches do: they load additional data now, quietly as part of a cache line, so that the CPU can quickly access the additional data later if it needs it.
APPENDIX: FORTRAN/BLAS/LAPACK MATRIX ORDERING
It is standard in numerical computing to store matrices by column rather than by row. This is called column-major storage. Unfortunately, unlike the earlier Fortran programming language, the C programming language was not originally designed for numerical computing, so, in C, to store arrays by column, one must write array[column][row] == array[j][i]—which notation of course reverses the way an applied mathematician with his or her pencil would write it.
This is an artifact of the C programming language. The artifact has no mathematical significance but, when programming in C, you must remember to type [j][i]. [Were you programming in the now mostly obsolete Fortran programming language, you would type (i, j), but this isn't Fortran.]
The reason column-major storage is standard has to do with the sequence in which the CPU performs scalar, floating-point multiplications and additions when, in mathematical/pencil terminology, a matrix [A] left-operates on a column vector x. The standard Basic Linear Algebra Subroutines (BLAS) library, used by LAPACK and others, works this way. You and I should work this way, too, not only because we are likely to need to interface with BLAS and/or LAPACK but because, numerically, it's smoother.

If you've transcribed the program correctly then you're correct, the 3/4 answer is wrong.
The 3/4 answer would be correct if the indexes in the innermost sum += ... statements were arranged so that the rightmost index varied the most quickly, i.e. as:
sum_x += array[i][j].x;
sum_y += array[i][j].y;
In that case the 1st, 5th, 9th ... iterations of the loop would miss, but the line loaded into the cache by each of those misses would cause the next three iterations to hit.
However, with the program as written, every iteration misses. Each cache line that is loaded from memory supplies data for only a single point, and then that line is always replaced before the data for any of the other three points in the line is accessed.
As an example (assuming for simplicity that the address of the first member array[0][0] is aligned with the start of the cache), the reference to array[31][31] in the first pass through the loop is a miss that causes line 127 of the cache to be loaded. Line 127 now contains the data for [31][28], [31][29], [31][30] and [31][31]. However, the fetch of array[15][31] causes line 127 to be overwritten before array[31][30] is referenced, so when [31][30]'s turn eventually arrives it is a miss too. And then a miss at [15][30] replaces the line before [31][29] is referenced.
IMO your 1/2 hit ratio is overgenerous because it counts the access to the .y coordinate as a hit. However, that's not what the original 3/4 answer does. If the fetch of the .y coordinate were counted as a hit then the original answer would have been 7/8. Instead it counts each complete point, or perhaps each loop iteration, as a hit or a miss. By that measure the hit rate for the program as written in your question is a nice round 0.

Related

elegant (and fast!) way to rearrange columns and rows in an ADC buffer

Abstract:
I am looking for an elegant and fast way to "rearrange" the values in my ADC Buffer for further processing.
Introduction:
on an ARM Cortex M4 Processor I am using 3 ADCs to sample analog values, with DMA and "Double Buffer Technique". When I get a "half buffer complete Interrupt" the data in the 1D array are arranged like this:
Ch1S1, Ch2S1, Ch3S1, Ch1S2, Ch2S2, Ch3S2, Ch1S3 ..... Ch1Sn-1, Ch2Sn-1, Ch3Sn-1, Ch1Sn, Ch2Sn, Ch3Sn
Where Sn stands for Sample# and CHn for Channel Number.
As I do 2x Oversampling n equals 16, the channel count is 9 in reality, in the example above it is 3
Or written in an 2D-form
Ch1S1, Ch2S1, Ch3S1,
Ch1S2, Ch2S2, Ch3S2,
Ch1S3 ...
Ch1Sn-1, Ch2Sn-1, Ch3Sn-1,
Ch1Sn, Ch2Sn, Ch3Sn
Where the rows represent the n samples and the colums represent the channels ...
I am using CMSIS-DSP to calculate all the vector stuff, like shifting, scaling, multiplication, once I have "sorted out" the channels. This part is pretty fast.
Issue:
But the code I am using for "reshaping" the 1-D Buffer array to an accumulated value for each channel is pretty poor and slow:
for(i = 0; i < ADC_BUFFER_SZ; i++) {
for(j = 0; j < MEAS_ADC_CHANNELS; j++) {
if(i) *(ADC_acc + j) += *(ADC_DMABuffer + bP); // sum up all elements
else *(ADC_acc + j) = *(ADC_DMABuffer + bP); // initialize new on first run
bP++;
}
}
After this procedure I get a 1D array with one (accumulated) U32 value per Channel, but this code is pretty slow: ~4000 Clock cycles for 16 Samples per channel / 9 Channels or ~27 Clock cycles per sample. In order to archive higher Sample rates, this needs to be many times faster, than it is right now.
Question(s):
What I am looking for is: some elegant way, using the CMSIS-DPS functions to archive the same result as above, but much faster. My gut says that I am thinking in the wrong direction, that there must be a solution within the CMSIS-DSP lib, as I am most probably not the first guy who stumbles upon this topic and I most probably won't be the last. So I'm asking for a little push in the right direction, I as guess this could be a severe case of "work-blindness" ...
I was thinking about using the dot-product function "arm_dot_prod_q31" together with an array filled with ones for the accumulation task, because I could not find the CMSIS function which would simply sum up an 1D array? But this would not solve the "reshaping" issue, I still had to copy data around and create new buffers to prepare the vectors for the "arm_dot_prod_q31" call ...
Besides that it feels somehow awkward using a dot-product, where I just want to sum up array elements …
I also thought about transforming the ADC Buffer into a 16 x 9 or 9 x 16 Matrix, but then I could not find anything where I could easily (=fast & elegant) access rows or columns, which would leave me with another issue to solve, which would eventually require to create new buffers and copying data around, as I am missing a function where I could multiply a matrix with a vector ...
Maybe someone has a hint for me, that points me in the right direction?
Thanks a lot and cheers!
ARM is a risk device, so 27 cycles is roughly equal to 27 instructions, IIRC. You may find that you're going to need a higher clock rate to meet your timing requirements. What OS are you running? Do you have access to the cache controller? You may need to lock data buffers into the cache to get high enough performance. Also, keep your sums and raw data physically close in memory as your system will allow.
I am not convinced your perf issue is entirely the consequence of how you are stepping through your data array, but here's a more streamlined approach than what you are using:
int raw[ADC_BUFFER_SZ];
int sums[MEAS_ADC_CHANNELS];
for (int idxRaw = 0, int idxSum = 0; idxRaw < ADC_BUFFER_SZ; idxRaw++)
{
sums[idxSum++] += raw[idxRaw];
if (idxSum == MEAS_ADC_CHANNELS) idxSum = 0;
}
Note that I have not tested the above code, nor even tried to compile it. The algorithm is simple enough, you should be able to get working quickly.
Writing pointer math in your code, will not make it any faster. The compiler will convert array notation to efficient pointer math for you. You definitely don't need two loops.
That said, I often use a pointer for iteration:
int raw[ADC_BUFFER_SZ];
int sums[MEAS_ADC_CHANNELS];
int *itRaw = raw;
int *itRawEnd = raw + ADC_BUFFER_SZ;
int *itSums = sums;
int *itSumsEnd = itSums + MEAS_ADC_CHANNELS;
while(itRaw != itEnd)
{
*itSums += *itRaw;
itRaw++;
itSums++;
if (itSums == itSumsEnd) itSums = sums;
}
But almost never, when I am working with a mathematician or scientist, which is often the case with measurement/metrological device development. It's easier to explain the array notation to non-C reviewers, than the iterator form.
Also, if I have an algorithm description that uses the phrase "for each...", I tend to prefer the for loop form, but when the description uses "while ...", then of course I will probably use the while... form, unless I can skip one or more variable assignment statements by rearranging it to a do..while. But I often stick as close as possible to the original description until after I've passed all the testing criteria, then do rearrangement of loops for code hygiene purposes. It's easier to argue with a domain expert that their math is wrong, when you can easily convince them that you implemented what they described.
Always get it right first, then measure and make the determination whether to further hone the code. Decades ago, some C compilers for embedded systems could do a better job of optimizing one kind of loop than another. We used to have to keep a warry eye on the machine code they generated, and often developed habits that avoided those worst case scenarios. That is uncommon today, and almost certainly not the case for you ARM tool chain. But you may have to look into how your compilers optimization features work and try something different.
Do try to avoid doing value math on the same line as your pointer math. It's just confusing:
*(p1 + offset1) += *(p2 + offset2); // Can and should be avoided.
*(p1++) = *(p2++); // reasonable, especially for experienced coders/reviewers.
p1[offset1] += p2[offset2]; // Okay. Doesn't mix math notation with pointer notation.
p1[offset1 + A*B/C] += p2...; // Very bad.
// But...
int offset1 += A*B/C; // Especially helpful when stepping in the debugger.
p1[offset1]... ; // Much better.
Hence the iterator form mentioned earlier. It may reduce the lines of code, but does not reduce the complexity and definitely does increase the odds of introducing a bug at some point.
A purist could argue that p1[x] is in fact pointer notation in C, but array notation has almost, if not completely universal binding rules across languages. Intentions are obvious, even to non programmers. While the examples above are pretty trivial and most C programmers would have no problems reading any of them, it's when the number of variables involved and the complexity of the math increases, that mixing your value math with pointer math quickly becomes problematic. You'll almost never do it for anything non-trivial, so for consistency's sake, just get in the habit of avoiding it all-together.

Last used cache line versus different cache lines

Let's assume cache lines are 64 bytes wide and I have two arrays a and b which fill a cache line and are also aligned to a cache line. Let's also assume that both arrays are in the L1 cache so when I read from them I don't get a cache miss.
float a[16]; //64 byte aligned e.g. with __attribute__((aligned (64)))
float b[16]; //64 byte aligned
I read a[0]. My question is it faster to now read a[1] than to read b[0]? In other words, is it faster to read from the last used cache line?
Does the set matter? Let's now assume that I have a 32 kb L1 data cache which is 4 way. So if a and b are 8192 bytes apart they end up in the same set. Will this change the answer to my question?
Another way to ask my question (which is what I really care about) is in regards to reading a matrix.
In other words which one of these two code options will be more efficient assuming matrix M fits in the L1 cache and is 64 byte aligned and is already in the L1 cache.
float M[16][16]; //64 byte aligned
Version 1:
for(int i=0; i<16; i++) {
for(int j=0; j<16; j++) {
x += M[i][j];
}
}
Version 2:
for(int i=0; i<16; i++) {
for(int j=0; j<16; j++) {
x += M[j][i];
}
}
Edit: To make this clear due to SSE/AVX lets assume I read the first eight values from a at once with AVX (e.g. with _mm256_load_ps()). Will reading the next eight values from a be faster than reading the first eight values from b (recall that a and b are already in the cache so there will not be a cahce miss)?
Edit:: I'm mostly interested in all processors since Intel Core 2 and Nehalem but I'm currently working with an Ivy Bridge processor and plan to use Haswell soon.
With current Intel processors, there is no performance difference between loading two different cache lines that are both in L1 cache, all else being equal. Given float a[16], b[16]; with a[0] recently loaded, a[1] in the same cache line as a[0], and b[1] not recently loaded but still in L1 cache, then there will be no performance difference between loading a[1] and b[0] in the absence of some other factor.
One thing that can cause a difference is if there has recently been a store to some address that shares some bits with one of the values being loaded, although the entire address is different. Intel processors compare some of the bits of addresses to determine whether they might match a store that is currently in progress. If the bits match, some Intel processors delay the load instruction to give the processor time to resolve the complete virtual address and compare it to the address being stored. However, this is an incidental effect that is not particular to a[1] or b[0].
It is also theoretically possible that a compiler that sees your code is loading both a[0] and a[1] in short succession might make some optimization, such as loading them both with one instruction. My comments above apply to hardware behavior, not C implementation behavior.
With the two-dimensional array scenario, there should still be no difference as long as the entire array M is in L1 cache. However, column traversals of arrays are notorious for performance problems when the array exceeds L1 cache. A problem occurs because addresses are mapped to sets in cache by fixed bits in the address, and each cache set can hold only a limited number of cache lines, such as four. Here is a problem scenario:
An array M has a row length that is a multiple of the distance that results in addresses being mapped to the same cache sets, such as 4096 bytes. E.g., in the array float M[1024][1024];, M[0][0] and M[1][0] are 4096 bytes apart and map to the same cache set.
As you traverse a column of the array, you access M[0][0], M[1][0], M[2][0], M[3][0], and so on. The cache line for each of these elements is loaded into cache.
As you continue along the column, you access M[8][0], M[9][0], and so on. Since each of these uses the same cache set as the previous ones and the cache set can hold only four lines, the earlier lines containing M[0][0] and so on are evicted from cache.
When you complete the column and start the next column by reading M[0][1], the data is no longer in L1 cache, and all of your loads must fetch the data from L2 cache (or worse if you also thrashed L2 cache in the same way).
Fetching a[0] and then either a[1] or b[0] should amount to 2 cache access that hit the L1 in either case. You didn't say which uArch you're using but i'm not familiar with any mechanism that does further "caching" of the full cacheline above the L1 (anywhere in the memory unit), and I don't think such a mechanism could be feasible (at least not for any reasonable price).
Assume you read a[0] and then a[1], and would like to save the effort of accessing the L1 again for that line - your HW would have to not only keep the full cache line somewhere in the memory unit in case it's going to be accessed again (not sure how much that's a common case, so this feature is probably not the effort), but also keep it snoopable as a logical extension of your cache in case some other core tries to modify a[1] between these two reads (which x86 permits for wb memory). In fact, it could even be a store in the same thread context, and you'll have to guard against that (since most common x86 CPUs today are performing loads out of order). If you don't maintain both of these (and probably other safeguards too) - you break coherency, if you do - you've created a monster logic that does that same as your L1 already does, just to save meager 1-2 cycles of access.
However, even though both options would require the same number of cache accesses, there may be other considerations effecting their efficiency, such as L1 banking, same-set access restrictions, lazy LRU updating, etc.. All of which depend on your exact machine implementation.
If you don't focus only on memory/cache access efficiency, your compiler should be able to vectorize accesses to consecutive memory locations, which would still incur the same accesses but will be lighter on execution BW. I think that any decent compiler should be able to unroll your loops at this size, and combine the consecutive accesses into a single vector, but you may be able to help it by using option 1 (especially if there are also writes or other problematic instructions in the middle that would compilcate the job for the compiler)
Edit
Since you're also asking about fitting the matrix in the L2 - that simplifies the question - in that case using the same line(s) multiple times as in option 1 is better as it allows you to hit the L1, while the alternative is to constantly fetch from the L2, which gives you lower latency and bandwidth. This is the basic principle behind loop tiling / blocking
Spatial locality is king so version #1 is faster. A good compiler can even vectorize the reads using SSE/AVX.
The CPU rearranges reads so it doesn't matter which one is first. In out-of-order CPUs it should matter very little if the both cache lines are on the same way.
For large matrices, it is even more important to keep locality so the L1 cache remains hot (less cache misses).
Although I don't know the answer to your question(s) directly (someone else may have more knowledge about processor architecture), have you tried / is it possible to find out the answer yourself by some form of benchmarking?
You can get a high resolution timer by some function such as QueryPerformanceCounter (assuming you're on Windows) or OS equivalent, then iterate the reads you want to test by x amount of times, then get the high resolution timer again to get the average time a read took.
Perform this process again for different reads and you should be able to compare average read times for different types of read, which should answer your question. That's not to say that the answer will remain the same on different processors though.

Array access taking O(1) time improvable?

I've been reading a book assigned for class and it mentions that array access takes O(1) time. I realize that this is very fast (maybe as fast as possible), but if you have a loop that has to refer to this a few times, is there any advantage to assigning a temporary variable to have the value looked up in the array? Or would using the temporary variable still be O(1) to use as well?
I'm assuming this question is language independent. Also I realize that even if the answer is yes that the advantage is tiny, I'm just curious.
Note that O(1) doesn't mean "instantaneous." It just means "at most some constant." This means that both 1 and 101000 are both O(1), even though the second of these is bigger than the number of atoms in the universe.
If you are repeatedly accessing the same array element multiple times, it will take O(1) time for each access. Storing that array element in a local variable also gives O(1) lookup time, but the constants might not be the same. It might be better to pick one option over the other, but you'd really have to profile the program to be sure.
In practice, this sort of microoptimization is unlikely to have a measurable effect on program time unless the code you're running accounts for a huge fraction of the program's running time. I would be shocked to find an example where this change would make a noticeable impact in any real code.
Modern architectures probably might make this change a bit faster, but not dramatically so. If you keep accessing the same array element multiple times, the processor will probably keep that part of the array in cache, making lookups really fast. Also, a good optimizing compiler might already turn the non-local-copy code into the local copy code for you.
Hope this helps!
If I understand, you're asking if
for (int i=0; i<len; i++) {
int temp = ar[i];
foo += temp;
bar -= temp;
}
is any better than:
for (int i=0; i<len; i++) {
foo += ar[i];
bar -= ar[i];
}
I wouldn't worry about it:
If the code in the body of your loop is going to access the same array entry, say ar[i] multiple times, any halfway decent compiler (at a nonzero optimization level) will keep that value in a register for quick re-use. In other words, the compiler will probably generate the exact same assembly given the either of the above code samples.
Note that either of these is still O(1) (accessing one thing one time). Don't confuse big-O notation of algorithms with instruction-level optimizations.
Edit
I just compiled a sample program with two functions, containing the above two samples, and at -O2 gcc 4.7.2 generated the exact same machine code; byte-for-byte.
The only way you can perform better than O(1) time is to not have to do anything in the first place. That would be O(0) time.
Or with fewer words: No.
There are already things built into modern CPU hardware (cache lines for example) that do something like what you describe but better in a way that a temporary variable cannot do. Even better than that, no source modification is needed.
No. Array access is not some magical zero-footprint thing made out of sparkles and love. The algorithm to determine address from array indices in C can be seen here. The more dimensions you have on your array, the slower it gets to access, as additional operations (primarily muls and conditionals, in terms of cost) are required to arrive at the final, 1D memory address. Even if your array has just one dimension, you still have to calculate the offset on the base address, which is a single add operation, hence O(1).

Performance Optimization for Matrix Rotation

I'm now trapped by a performance optimization lab in the book "Computer System from a Programmer's Perspective" described as following:
In a N*N matrix M, where N is multiple of 32, the rotate operation can be represented as:
Transpose: interchange elements M(i,j) and M(j,i)
Exchange rows: Row i is exchanged with row N-1-i
A example for matrix rotation(N is 3 instead of 32 for simplicity):
------- -------
|1|2|3| |3|6|9|
------- -------
|4|5|6| after rotate is |2|5|8|
------- -------
|7|8|9| |1|4|7|
------- -------
A naive implementation is:
#define RIDX(i,j,n) ((i)*(n)+(j))
void naive_rotate(int dim, pixel *src, pixel *dst)
{
int i, j;
for (i = 0; i < dim; i++)
for (j = 0; j < dim; j++)
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}
I come up with an idea by inner-loop-unroll. The result is:
Code Version Speed Up
original 1x
unrolled by 2 1.33x
unrolled by 4 1.33x
unrolled by 8 1.55x
unrolled by 16 1.67x
unrolled by 32 1.61x
I also get a code snippet from pastebin.com that seems can solve this problem:
void rotate(int dim, pixel *src, pixel *dst)
{
int stride = 32;
int count = dim >> 5;
src += dim - 1;
int a1 = count;
do {
int a2 = dim;
do {
int a3 = stride;
do {
*dst++ = *src;
src += dim;
} while(--a3);
src -= dim * stride + 1;
dst += dim - stride;
} while(--a2);
src += dim * (stride + 1);
dst -= dim * dim - stride;
} while(--a1);
}
After carefully read the code, I think main idea of this solution is treat 32 rows as a data zone, and perform the rotating operation respectively. Speed up of this version is 1.85x, overwhelming all the loop-unroll version.
Here are the questions:
In the inner-loop-unroll version, why does increment slow down if the unrolling factor increase, especially change the unrolling factor from 8 to 16, which does not effect the same when switch from 4 to 8? Does the result have some relationship with depth of the CPU pipeline? If the answer is yes, could the degrade of increment reflect pipeline length?
What is the probable reason for the optimization of data-zone version? It seems that there is no too much essential difference from the original naive version.
EDIT:
My test environment is Intel Centrino Duo architecture and the verion of gcc is 4.4
Any advice will be highly appreciated!
Kind regards!
What kind of processor are you testing this on? I dimly remember that unrolling loops helps when the processor can handle multiple operations at once, but only up to the maximum number of parallel executions. So if your processor can only handle 8 simultaneous instructions, then unrolling to 16 won't help. But someone with knowledge of more recent processor design will have to pipe up/correct me.
EDIT: According to this PDF, the centrino core2 duo has two processors, each of which is capable of 4 simultaneous instructions. It's generally not so simple, though. Unless your compiler is optimizing across both cores (ie, when you run the task manager (if you're on windows, top if you're on linux), you'll see that CPU usage is maxed out), your process will be running on one core at a time. The processor also features 14 stages of execution, so if you can keep the pipeline full, you'll get a faster execution.
Continuing along the theoretical route, then, you get a speed improvement of 33% with a single unroll because you're starting to take advantage of simultaneous instruction execution. Going to 4 unrolls doesn't really help, because you're now still within that 4-simultaneous-instruction limit. Going to 8 unrolls helps because the processor can now fill the pipeline more completely, so more instructions will get executed per clock cycle.
For this last, think about how a McDonald's drive through works (I think that that's relatively widespread?). A car enters the drivethrough, orders at one window, pays at a second window, and receives food at a third window. If a second drive enters when the first is still ordering, then by the time both finish (assuming each operation in the drive through takes one 'cycle' or time unit), then 2 full operations will be done by the time 4 cycles have elapsed. If each car did all of their operations at one window, then the first car would take 3 cycles for ordering, paying, and getting food, and then the second car would also take 3 cycles for ordering, paying and getting food, for a total of 6 cycles. So, operation time due to pipelining decreases.
Of course, you have to keep the pipeline full to get the largest speed improvement. 14 stages is a lot of stages, so going to 16 unrolls will give you some improvement still because more operations can be in the pipeline.
Going to 32 causing a decrease in performance may have to do with bandwidth to the processor from the cache (again a guess, can't know for sure without seeing your code exactly, as well as the machine code). If all the instructions can't fit into cache or into the registers, then there is some time necessary to prepare them all to run (ie, people have to get into their cars and get to the drive through in the first place). There will be some reduction in speed if they all get there all at once, and some shuffling of the line has to be done to make the operation proceed.
Note that each movement from src to dst is not free or a single operation. You have the lookups into the arrays, and that costs time.
As for why the second version works so quickly, I'm going to hazard a guess that it has to do with the [] operator. Every time that gets called, you're doing some lookups into both the src and dst arrays, resolving pointers to locations, and then retrieving the memory. The other code is going straight to the pointers of the arrays and accessing them directly; basically, for each of the movements from src to dst, there are less operations involved in the move, because the lookups have been handled explicitly through pointer placement. If you use [], these steps are followed:
do any math inside the []
take a pointer to that location (startOfArray + [] in memory)
return the result of that location in memory
If you walk along with a pointer, you just do the math to do the walk (typically just an addition, no multiplication) and then return the result, because you've already done the second step.
If I'm right, then you might get better results with the second code by unrolling its inner loop as well, so that multiple operations can be pipelined simultaneously.
The first part of the question I'm not sure about. My initial thought was some sort of cache problem, but you're only accessing each item once.
The other code could be faster for a coupe reasons.
1) The loops count down instead of up. Comparing a loop counter to zero costs nothing on most architectures (a flag is set by the decrement automatically) you have to explicitly compare to a max value with each iteration.
2) There is no math in the inner loop. You are doing a bunch of math in your inner loop. I see 2 subtractions in the main code and a multiply in the macro (which is used twice). There is also the implicit addition of the resulting indexes to the base address of the array which is avoided by the use of pointers (good addressing modes on x86 should eliminate this penalty too).
When writing optimized code, you always construct it bottom up from the inside. This means taking the inner-most loop and reducing its content to nearly zero. In this case, moving data is unavoidable. Incrementing a pointer is the bare minimum to get to the next item, the other pointer needs to add an offset to get to its next item. So at a minimum we have 4 operations: load, store, increment, add. If an architecture supported "move with post-increment" this would be 2 instructions total. On Intel I suspect it's 3 or 4 instructions. Anything more than this like subtractions and multiplication is going to add significant code.
Looking at the assembly code of each version should offer much insight.
If you run this repeatedly on a small matrix (32x32) that fits completely in cache you should should see even more dramatic differences in implementations. Running on a 1024x1024 matrix will be much slower than doing 1024 rotations of a single 32x32 even though the number of data copies is the same.
The main purpose of loop unrolling is to reduce the time spent on the loop control (test for completion, incrementing counters, etc...). This is a case of diminishing returns though, since as the loop is unrolled more and more, the time spent on loop control becomes less and less significant. Like mmr said, loop unrolling may also help the compiler to execute things in parallel, but only up to a point.
The "data-zone" algorithm appears to be a version of a cache efficient matrix transpose algorithm. The problem with computing a transpose the naive way is that it results in a lot of cache misses. For the source array, you are accessing the memory along each row, so it is accessed in a linear manner, element-by-element. However, this requires that you access the destination array along the columns, meaning you are jumping dim elements each time you access an element. Basically, for each row of the input, you are traversing the memory of the entire destination matrix. Since the whole matrix probably won't fit in the cache, memory has to be loaded and unloaded from the cache very often.
The "data-zone" algorithm takes the matrix that you are accessing by column and only performs the transpose for 32 rows at a time, so the amount of memory you are traversing is 32xstride, which should hopefully fit completely into the cache. Basically the aim is to work on sub-sections that fit in the cache and reduce the amount of jumping around in memory.

How cache memory works?

Today when I was in computer organization class, teacher talked about something interesting to me. When it comes to talk about Why cache memory works, he said that:
for (i=0; i<M; i++)
for(j=0; j<N; j++)
X[i][j] = X[i][j] + K; //X is double(8 bytes)
it is not good to change the first line with the second. What is your opinions on this? And why it is like that?
There is a very good paper by Ulrich Drepper of Red Hat and glibc fame, What Every Programmer Should Know About Memory. One section discussed caches in great detail. For example, there are cache effects in SMP systems where CPUs can end up thrashing ownership of a modified cache line back and forth, greatly harming performance.
Locality of reference. Because the data is stored by rows, for each row the j columns are in adjacent memory addresses. The OS will typically load an entire page from memory into the cache and adjacent address references will likely refer to that same page. If you increment by the row index in the inner loop it is possible that these rows will be on different pages (since they are separated by j doubles each) and the cache may have to constantly bring in and throw away pages of memory as it references the data. This is called thrashing and is bad for performance.
In practice and with larger, modern caches, the sizes of the rows/columns would need to be reasonably large before this would come into play, but it's still good practice.
[EDIT] The answer above is specific to C and may differ for other languages. The only one that I know is different is FORTRAN. FORTRAN stores things in column major order (the above is row major) and it would be correct to change the order of the statements in FORTRAN. If you want/need efficiency, it's important to know how your language implements data storage.
It is like that becauses caches like locality. The same number of memory accessed, but spaced further apart, will hit different "lines" of cache, or might even miss the cache altogether. It is therefore good, whenever you have the choice, to organize data so that accesses that are likely to happen close to each other in time, also do so in space. This increases the chance of a cache hit, and gives you more performance.
There is of course a wealth of information about this topic available, see for instancethis wikipedia entry on locality of reference. Or, I guess, your own course text book. :)
In C, n-dimensional matrices are row major, meaning the last index into the matrix represents adjacent spaces in memory. This is different than some other languages, FORTRAN for example, which are column major. In FORTRAN, it's more efficient to iterate through a 2D matrix like this:
do jj = 1,N
do ii = 1,M
x(ii,jj) = x(ii,jj) + K;
enddo
enddo
Cache memory is very fast and very expensive memory that sits close to the CPU. Rather than fetch one small piece of data from the RAM each time, the CPU fetches a chunk of data and stores it in the cache. The bet is that if you just read one byte, then the next byte you read is likely to be right after it. If this is the case, then it can come from the cache.
By laying out your loop as you have it, you read the bytes in the order that they are stored in memory. This means that they are in the cache, and can be read very quickly by the CPU. If you swapped around lines 1 and 2, then you'd read every "N" bytes each time around the loop. The bytes you are reading are no longer consecutive in memory, and so they may not be in the cache. The CPU has to fetch them from the (slower) RAM, and so your performance decreases.

Resources