Use two loop bodies or one (result identical)? - c

I have long wondered what is more efficient with regards to making better use of CPU caches (which are known to benefit from locality of reference) - two loops each iterating over the same mathematical set of numbers, each with a different body statement (e.g. a call to a function for each element of the set), or having one loop with a body that does the equivalent of two (or more) body statements. We assume identical application state after all the looping.
In my opinion, having two loops would introduce fewer cache misses and evictions because more instructions and data used by the loop fit in the cache. Am I right?
Cost of a f and g call is negligible compared to cost of the loop
f and g use most of the cache each by itself, and so the cache would be spilled when one is called after another (the case with a single-loop version)
Intel Core Duo CPU
C language source code
The GCC compiler, "no extra switches"
I want answers outside the "premature optimization is evil" character, if possible.
An example of the two-loops version that I am advocating for:
int j = 0, k = 0;
for(int i = 0; i < 1000000; i++)
j += f(i);
for(int i = 0; i < 1000000; i++)
k += g(i);

To measure is to know.

I can see three variables (even in a seemingly simple chunk of code):
What do f() and g() do? Can one of them invalidate all of the instruction cache lines (effectively pushing the other one out)? Can that happen in L2 instruction cache too (unlikely)? Then keeping only one of them in it might be beneficial. Note: The inverse does not imply "have a single loop", because:
Do f() and g() operate on large amounts of data, according to i? Then, it'd be nice to know if they operate on the same set of data - again you have to consider whether operating on two different sets screws you up via cache misses.
If f() and g() are indeed that primitive as you first state, and I'm assuming both in code size as well as running time and code complexity, cache locality issues won't arise in little chunks of code like this - your biggest concern would be if some other process were scheduled with actual work to do, and invalidated all the caches until it were your process' turn to run.
A final thought: given that such processes like above might be a rare occurrence in your system (and I'm using "rare" quite liberally), you could consider making both your functions inline, and let the compiler unroll the loop. That is because for the instruction cache, faulting back to L2 is no big deal, and the probability that the single cache line that'd contain i, j, k would be invalidated in that loop doesn't look so horrible. However, if that's not the case, some more details would be useful.

Intuitively one loop is better: you increment i a million fewer times and all the other operation counts remain the same.
On the other hand it completely depends on f and g. If both are sufficiently large that each of their code or cacheable data that they use nearly fills a critical cache then swapping between f and g may completely swamp any single loop benefit.
As you say: it depends.

Your question is not clear enough to give a remotely accurate answer, but I think I understand where you are headed. The data you are iterating over is large enough that before you reach the end you will start to evict data so that the second time (second loop) you iterate over it some if not all will have to be read again.
If the two loops were joined so that each element/block is fetched for the first operation and then is already in cache for the second operation, then no matter how large the data is relative to the cache most if not all of the second operations will take their data from the cache.
Various things like the nature of the cache, the loop getting evicted by data then being fetched evicting data may cause some misses on the second operation. On a pc with an operating system, lots of evictions will occur with other programs getting time slices. But assuming an ideal world the first operation on index i of the data will fetch it from memory, the second operation will grab it from cache.
Tuning for a cache is difficult at best. I regularly demonstrate that even with an embedded system, no interrupts, single task, same source code. Execution time/performance can vary dramatically by simply changing compiler optimization options, changing compilers, both brands of compilers or versions of compilers, gcc 2.x vs 3.x vs 4.x (gcc is not necessarily producing faster code with newer versions btw)(and a compiler that is pretty good at a lot of targets is not really good at any one particular target). Same code different compilers or options can change execution time by several times, 3 times faster, 10 times faster, etc. Once you get into testing with or without a cache, it gets even more interesting. Add a single nop in your startup code so that your whole program moves one instruction over in memory and your cache lines now hit in different places. Same compiler same code. Repeat this with two nops, three nops, etc. Same compiler, same code you can see tens of percent (for the tests I ran that day on that target with that compiler) differences better and worse. That doesnt mean you cant tune for a cache, it just means that trying to figure out if your tuning is helping or hurting can be difficult. The normal answer is just "time it and see", but that doesnt work anymore, and you might get great results on your computer that day with that program with that compiler. But tomorrow on your computer or any day on someone elses computer you may be making things slower not faster. You need to understand why this or that change made it faster, maybe it had nothing to do with your code, your email program may have been downloading a lot of mail in the background during one test and not during the other.
Assuming I understood your question correctly I think the single loop is probably faster in general.

Breaking the loops into smaller chunks is a good idea.. It could improves the cache-hit ratio quite a lot and can make a lot of difference to the performance...
From your example:
int j = 0, k = 0;
for(int i = 0; i < 1000000; i++)
j += f(i);
for(int i = 0; i < 1000000; i++)
k += g(i);
I would either fuse the two loops into one loop like this:
int j = 0, k = 0;
for(int i = 0; i < 1000000; i++)
j += f(i);
k += g(i);
Of if this is not possible do the optimization called Loop-Tiling:
#define TILE_SIZE 1000 /* or whatever you like - pick a number that keeps */
/* the working-set below your first level cache size */
int i=0;
int elements = 100000;
do {
int n = i+TILE_SIZE;
if (n > elements) n = elements;
// perform loop A
for (int a=i; a<n; a++)
j += f(i);
// perform loop B
for (int a=i; a<n; a++)
k += g(i);
i += n
} while (i != elements)
The trick with loop tiling is, that if the loops share an access pattern the second loop body has the chance to re-use the data that has already been read into the cache by the first loop body. This won't happen if you execute loop A a million times because the cache is not large enough to hold all this data.
Breaking the loop into smaller chunks and executing them one after another will help here a lot. The trick is to limit the working-set of memory below the size of your first level cache. I aim for half the size of the cache, so other threads that get executed in-between don't mess up my cache so much..

If I came across the two-loop version in code, with no explanatory comments, I would wonder why the programmer did it that way, and probably consider the technique to be of dubious quality, whereas a one-loop version would not be surprising, commented or not.
But if I came across the two-loop version along with a comment like "I'm using two loops because it runs X% faster in the cache on CPU Y", at least I'd no longer be puzzled by the code, although I'd still question if it was true and applicable to other machines.

This seems like something the compiler could optimize for you so instead of trying to figure it out yourself and making it fast, use whatever method makes your code more clear and readable. If you really must know, time both methods for input size and calculation type that your application uses (try the code you have now but repeat your calculations many many times and disable optimization).


elegant (and fast!) way to rearrange columns and rows in an ADC buffer

I am looking for an elegant and fast way to "rearrange" the values in my ADC Buffer for further processing.
on an ARM Cortex M4 Processor I am using 3 ADCs to sample analog values, with DMA and "Double Buffer Technique". When I get a "half buffer complete Interrupt" the data in the 1D array are arranged like this:
Ch1S1, Ch2S1, Ch3S1, Ch1S2, Ch2S2, Ch3S2, Ch1S3 ..... Ch1Sn-1, Ch2Sn-1, Ch3Sn-1, Ch1Sn, Ch2Sn, Ch3Sn
Where Sn stands for Sample# and CHn for Channel Number.
As I do 2x Oversampling n equals 16, the channel count is 9 in reality, in the example above it is 3
Or written in an 2D-form
Ch1S1, Ch2S1, Ch3S1,
Ch1S2, Ch2S2, Ch3S2,
Ch1S3 ...
Ch1Sn-1, Ch2Sn-1, Ch3Sn-1,
Ch1Sn, Ch2Sn, Ch3Sn
Where the rows represent the n samples and the colums represent the channels ...
I am using CMSIS-DSP to calculate all the vector stuff, like shifting, scaling, multiplication, once I have "sorted out" the channels. This part is pretty fast.
But the code I am using for "reshaping" the 1-D Buffer array to an accumulated value for each channel is pretty poor and slow:
for(i = 0; i < ADC_BUFFER_SZ; i++) {
for(j = 0; j < MEAS_ADC_CHANNELS; j++) {
if(i) *(ADC_acc + j) += *(ADC_DMABuffer + bP); // sum up all elements
else *(ADC_acc + j) = *(ADC_DMABuffer + bP); // initialize new on first run
After this procedure I get a 1D array with one (accumulated) U32 value per Channel, but this code is pretty slow: ~4000 Clock cycles for 16 Samples per channel / 9 Channels or ~27 Clock cycles per sample. In order to archive higher Sample rates, this needs to be many times faster, than it is right now.
What I am looking for is: some elegant way, using the CMSIS-DPS functions to archive the same result as above, but much faster. My gut says that I am thinking in the wrong direction, that there must be a solution within the CMSIS-DSP lib, as I am most probably not the first guy who stumbles upon this topic and I most probably won't be the last. So I'm asking for a little push in the right direction, I as guess this could be a severe case of "work-blindness" ...
I was thinking about using the dot-product function "arm_dot_prod_q31" together with an array filled with ones for the accumulation task, because I could not find the CMSIS function which would simply sum up an 1D array? But this would not solve the "reshaping" issue, I still had to copy data around and create new buffers to prepare the vectors for the "arm_dot_prod_q31" call ...
Besides that it feels somehow awkward using a dot-product, where I just want to sum up array elements …
I also thought about transforming the ADC Buffer into a 16 x 9 or 9 x 16 Matrix, but then I could not find anything where I could easily (=fast & elegant) access rows or columns, which would leave me with another issue to solve, which would eventually require to create new buffers and copying data around, as I am missing a function where I could multiply a matrix with a vector ...
Maybe someone has a hint for me, that points me in the right direction?
Thanks a lot and cheers!
ARM is a risk device, so 27 cycles is roughly equal to 27 instructions, IIRC. You may find that you're going to need a higher clock rate to meet your timing requirements. What OS are you running? Do you have access to the cache controller? You may need to lock data buffers into the cache to get high enough performance. Also, keep your sums and raw data physically close in memory as your system will allow.
I am not convinced your perf issue is entirely the consequence of how you are stepping through your data array, but here's a more streamlined approach than what you are using:
int raw[ADC_BUFFER_SZ];
for (int idxRaw = 0, int idxSum = 0; idxRaw < ADC_BUFFER_SZ; idxRaw++)
sums[idxSum++] += raw[idxRaw];
if (idxSum == MEAS_ADC_CHANNELS) idxSum = 0;
Note that I have not tested the above code, nor even tried to compile it. The algorithm is simple enough, you should be able to get working quickly.
Writing pointer math in your code, will not make it any faster. The compiler will convert array notation to efficient pointer math for you. You definitely don't need two loops.
That said, I often use a pointer for iteration:
int raw[ADC_BUFFER_SZ];
int *itRaw = raw;
int *itRawEnd = raw + ADC_BUFFER_SZ;
int *itSums = sums;
int *itSumsEnd = itSums + MEAS_ADC_CHANNELS;
while(itRaw != itEnd)
*itSums += *itRaw;
if (itSums == itSumsEnd) itSums = sums;
But almost never, when I am working with a mathematician or scientist, which is often the case with measurement/metrological device development. It's easier to explain the array notation to non-C reviewers, than the iterator form.
Also, if I have an algorithm description that uses the phrase "for each...", I tend to prefer the for loop form, but when the description uses "while ...", then of course I will probably use the while... form, unless I can skip one or more variable assignment statements by rearranging it to a do..while. But I often stick as close as possible to the original description until after I've passed all the testing criteria, then do rearrangement of loops for code hygiene purposes. It's easier to argue with a domain expert that their math is wrong, when you can easily convince them that you implemented what they described.
Always get it right first, then measure and make the determination whether to further hone the code. Decades ago, some C compilers for embedded systems could do a better job of optimizing one kind of loop than another. We used to have to keep a warry eye on the machine code they generated, and often developed habits that avoided those worst case scenarios. That is uncommon today, and almost certainly not the case for you ARM tool chain. But you may have to look into how your compilers optimization features work and try something different.
Do try to avoid doing value math on the same line as your pointer math. It's just confusing:
*(p1 + offset1) += *(p2 + offset2); // Can and should be avoided.
*(p1++) = *(p2++); // reasonable, especially for experienced coders/reviewers.
p1[offset1] += p2[offset2]; // Okay. Doesn't mix math notation with pointer notation.
p1[offset1 + A*B/C] += p2...; // Very bad.
// But...
int offset1 += A*B/C; // Especially helpful when stepping in the debugger.
p1[offset1]... ; // Much better.
Hence the iterator form mentioned earlier. It may reduce the lines of code, but does not reduce the complexity and definitely does increase the odds of introducing a bug at some point.
A purist could argue that p1[x] is in fact pointer notation in C, but array notation has almost, if not completely universal binding rules across languages. Intentions are obvious, even to non programmers. While the examples above are pretty trivial and most C programmers would have no problems reading any of them, it's when the number of variables involved and the complexity of the math increases, that mixing your value math with pointer math quickly becomes problematic. You'll almost never do it for anything non-trivial, so for consistency's sake, just get in the habit of avoiding it all-together.

Performance of algorithms that loop more/less but with same number of O(1) operations

Whenever I see algorithm optimization, I see lots of talk about reducing loop count. Often times, I see multiple operations being incorporated into one loop that were originally done separately.
Ultimately, the same number of O(1) processes are performed. It's just that one algorithm splits them into multiple iterations. Is there honestly a performance benefit to combining operations, from a scaling perspective?
Overly simplified example. I'm aware this is not a good example because the inner time complexity operations are low compared to the act of even incrementing i, but you get my point.
let tally1 = 0
let tally2 = 0
for (let i = 0; i < 10; i++) {
tally1 += 1
for (let i = 0; i < 10; i++) {
tally2 += 1
// vs
for (let i = 0; i < 10; i++) {
tally1 += 1
tally2 += 1
It is obvious that the second version will perform better because all the operations that make up the loop only have to be executed once.
So while the operations executed inside the loop will perform no better or worse, the overall execution time will be shorter.
Whether that is relevant or not largely depends on how expensive the operations inside the loop are. If they are cheap, the overhead of the loop will be noticeable, and it may be worth optimizing the code. If they are expensive, it might not be worth the effort.
Besides performance, clarity of the code is also a good thing. So if it doesn't matter form a performance point of view, you should choose the code that is better to read.
In very short loops, the overhead of the loop construction itself (increment and termination test) is "significant". To the point that compilers may perform "loop unrolling" optimizations, i.e. replicate the loop body to avoid performing the intermediate tests (with some extra care to handle termination).
Loop merging can bring similar speedups.
When the loop bodies are more complicated, the loop overhead becomes more negligible, and performance can even degrade when you merge the loops because you may saturate the number of required registers or degrade cache efficiency.
For ordinary programs, these kinds of micro-optimization are often not worth the effort. They are more relevant in the development of reusable code of general usefulness, such as the BLAS routines.

Loop unrolling & optimization

Given the code :
for (int i = 0; i < n; ++i)
A(i) ;
B(i) ;
C(i) ;
And the optimization version :
for (int i = 0; i < (n - 2); i+=3)
Something is not clear to me : which is better ? I can't see anything that works any faster using the other version . Am I missing something here ?
All I see is that each instruction is depending on the previous instruction , meaning that
I need to wait that the previous instruction would finish in order to start the one after ...
In the high-level view of a language, you're not going to see the optimization. The speed enhancement comes from what the compiler does with what you have.
In the first case, it's something like:
In the second it's something like:
You can see in the latter case, the overhead of testing and jumping is only 1 instruction per 3. In the first it's 1 instruction per 1; so it happens a lot more often.
Therefore, if you have invariants you can rely on (an array of mod 3, to use your example) then it is more efficient to unwind loops because the underlying assembly is written more directly.
Loop unrolling is used to reduce the number of jump & branch instructions which could potentially make the loop faster but will increase the size of the binary. Depending on the implementation and platform, either could be faster.
Well, whether this code is "better" or "worse" totally depends on implementations of A, B and C, which values of n you expect, which compiler you are using and which hardware you are running on.
Typically the benefit of loop unrolling is that the overhead of doing the loop (that is, increasing i and comparing it with n) is reduced. In this case, could be reduced by a factor of 3.
As long as the functions A(), B() and C() don't modify the same datasets, the second verion provides more parallelization options.
In the first version, the three functions could run simultaneously, assuming no interdependencies. In the second version, all three functions could be run with all three datasets at the same time, assuming you had enough execution units to do so and again, no interdependencies.
Generally its not a good idea to try to "invent" optimizations, unless you have hard evidence that you will gain an increase, because many times you may end up introducing a degradation. Typically the best way to obtain such evidence is with a good profiler. I would test both versions of this code with a profiler to see the difference.
Also, many times loop unrolling isnt very protable, as mentioned previously, it depends greatly on the platform, compiler, etc.
You can additionally play with the compiler options. An interesting gcc option is "-floop-optimize", that you get automatically with "-O, -O2, -O3, and -Os"
EDIT Additionally, look at the "-funroll-loops" compiler option.

Array access/write performance differences?

This is probably going to language dependent, but in general, what is the performance difference between accessing and writing to an array?
For example, if I am trying to write a prime sieve and am representing the primes as a boolean array.
Upon finding a prime, I can say
for(int i = 2; n * i < end; i++)
prime[n * i] = false;
for(int i = 2; n * i < end; i++)
if(prime[n * i])
prime[n * i] = false;
The intent in the latter case is to check the value before writing it to avoid having to rewrite many values that have already been checked. Is there any realistic gain in performance here, or are access and write mostly equivalent in speed?
Impossible to answer such a generic question without the specifics of the machine/OS this is running on, but in general the latter is going to be slower because:
The second example you have to get the value from RAM to L2/L1 cache and read it to a register, make a chance on the value and write it back. In the first case you might very well get away with simply writing a value to the L1/L2 caches. It can written to RAM from the caches later while your program is doing something else.
The second form has much more code to execute per iteration. For large enough number of iterations, the difference gets big real fast.
In general this depends much more on the machine than the programing language. The writes often will take a few more clock cycles because, depending on the machine, more cache values need to be updated in memory.
However, your second segment of code will be WAY slower, and it's not just because there's "more code". The big reason is that anytime you use an if-statement on most machines the CPU uses a branch predictor. The CPU literally predicts which way the if-statement will run ahead of time, and if it's wrong it has to backtrack. See and to understand why.
If you want to do some optimization, I would recommend the following:
Profile! See what's really taking up time.
Multiplication is much harder than addition. Try rewriting the loop so that i += n, and use this for your array index.
The loop condition "should" be totally reevaluated at every iteration unless the compiler optimizes it away. So try avoiding multiplication in there.
Use -O2 or -O3 as a compiler option
You might find that some values of n are faster than others because of cache locality. You might think of some clever ways to rewrite your code to take advantage of this.
Disassemble the code and look at what it's actually doing on your processor
It's a hard question and it heavily depends on your hardware, OS and complier. But for sake of theory, you should consider two things: branching and memory access. As branching is generally evil, you want to avoid it. I wouldn't even surprise if some compiler optimization took place and your second snippet would be reduced to the first one (compilers love avoiding branches, they probably consider it as a hobby, but they have a reason). So in these terms the first example is much cleaner and easier to deal with.
There're also CPU caches and other memory related issues. I believe that in both examples you have to actually load the memory into the CPU cache, so you can either read it or update. While reading is not a problem, writing have to propagate the changes up. I wouldn't be worried if you use the function in a single thread (as #gby pointed out, OS can push the changes a little bit later).
There is only one scenario I can come up with, that would make me consider solution from your second example. If I shared the table between threads to work on it in parallel (without locking) and had separate caches for different CPUs. Then, every time you amend the cache line from one thread, the other thread have to update it's copy before reading or writing to the same memory block. It's known as a cache coherence and it actually may hurt your performance badly; in such a case I could consider conditional writes. But wait, it's probably far away from your question...

Performance Optimization for Matrix Rotation

I'm now trapped by a performance optimization lab in the book "Computer System from a Programmer's Perspective" described as following:
In a N*N matrix M, where N is multiple of 32, the rotate operation can be represented as:
Transpose: interchange elements M(i,j) and M(j,i)
Exchange rows: Row i is exchanged with row N-1-i
A example for matrix rotation(N is 3 instead of 32 for simplicity):
------- -------
|1|2|3| |3|6|9|
------- -------
|4|5|6| after rotate is |2|5|8|
------- -------
|7|8|9| |1|4|7|
------- -------
A naive implementation is:
#define RIDX(i,j,n) ((i)*(n)+(j))
void naive_rotate(int dim, pixel *src, pixel *dst)
int i, j;
for (i = 0; i < dim; i++)
for (j = 0; j < dim; j++)
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
I come up with an idea by inner-loop-unroll. The result is:
Code Version Speed Up
original 1x
unrolled by 2 1.33x
unrolled by 4 1.33x
unrolled by 8 1.55x
unrolled by 16 1.67x
unrolled by 32 1.61x
I also get a code snippet from that seems can solve this problem:
void rotate(int dim, pixel *src, pixel *dst)
int stride = 32;
int count = dim >> 5;
src += dim - 1;
int a1 = count;
do {
int a2 = dim;
do {
int a3 = stride;
do {
*dst++ = *src;
src += dim;
} while(--a3);
src -= dim * stride + 1;
dst += dim - stride;
} while(--a2);
src += dim * (stride + 1);
dst -= dim * dim - stride;
} while(--a1);
After carefully read the code, I think main idea of this solution is treat 32 rows as a data zone, and perform the rotating operation respectively. Speed up of this version is 1.85x, overwhelming all the loop-unroll version.
Here are the questions:
In the inner-loop-unroll version, why does increment slow down if the unrolling factor increase, especially change the unrolling factor from 8 to 16, which does not effect the same when switch from 4 to 8? Does the result have some relationship with depth of the CPU pipeline? If the answer is yes, could the degrade of increment reflect pipeline length?
What is the probable reason for the optimization of data-zone version? It seems that there is no too much essential difference from the original naive version.
My test environment is Intel Centrino Duo architecture and the verion of gcc is 4.4
Any advice will be highly appreciated!
Kind regards!
What kind of processor are you testing this on? I dimly remember that unrolling loops helps when the processor can handle multiple operations at once, but only up to the maximum number of parallel executions. So if your processor can only handle 8 simultaneous instructions, then unrolling to 16 won't help. But someone with knowledge of more recent processor design will have to pipe up/correct me.
EDIT: According to this PDF, the centrino core2 duo has two processors, each of which is capable of 4 simultaneous instructions. It's generally not so simple, though. Unless your compiler is optimizing across both cores (ie, when you run the task manager (if you're on windows, top if you're on linux), you'll see that CPU usage is maxed out), your process will be running on one core at a time. The processor also features 14 stages of execution, so if you can keep the pipeline full, you'll get a faster execution.
Continuing along the theoretical route, then, you get a speed improvement of 33% with a single unroll because you're starting to take advantage of simultaneous instruction execution. Going to 4 unrolls doesn't really help, because you're now still within that 4-simultaneous-instruction limit. Going to 8 unrolls helps because the processor can now fill the pipeline more completely, so more instructions will get executed per clock cycle.
For this last, think about how a McDonald's drive through works (I think that that's relatively widespread?). A car enters the drivethrough, orders at one window, pays at a second window, and receives food at a third window. If a second drive enters when the first is still ordering, then by the time both finish (assuming each operation in the drive through takes one 'cycle' or time unit), then 2 full operations will be done by the time 4 cycles have elapsed. If each car did all of their operations at one window, then the first car would take 3 cycles for ordering, paying, and getting food, and then the second car would also take 3 cycles for ordering, paying and getting food, for a total of 6 cycles. So, operation time due to pipelining decreases.
Of course, you have to keep the pipeline full to get the largest speed improvement. 14 stages is a lot of stages, so going to 16 unrolls will give you some improvement still because more operations can be in the pipeline.
Going to 32 causing a decrease in performance may have to do with bandwidth to the processor from the cache (again a guess, can't know for sure without seeing your code exactly, as well as the machine code). If all the instructions can't fit into cache or into the registers, then there is some time necessary to prepare them all to run (ie, people have to get into their cars and get to the drive through in the first place). There will be some reduction in speed if they all get there all at once, and some shuffling of the line has to be done to make the operation proceed.
Note that each movement from src to dst is not free or a single operation. You have the lookups into the arrays, and that costs time.
As for why the second version works so quickly, I'm going to hazard a guess that it has to do with the [] operator. Every time that gets called, you're doing some lookups into both the src and dst arrays, resolving pointers to locations, and then retrieving the memory. The other code is going straight to the pointers of the arrays and accessing them directly; basically, for each of the movements from src to dst, there are less operations involved in the move, because the lookups have been handled explicitly through pointer placement. If you use [], these steps are followed:
do any math inside the []
take a pointer to that location (startOfArray + [] in memory)
return the result of that location in memory
If you walk along with a pointer, you just do the math to do the walk (typically just an addition, no multiplication) and then return the result, because you've already done the second step.
If I'm right, then you might get better results with the second code by unrolling its inner loop as well, so that multiple operations can be pipelined simultaneously.
The first part of the question I'm not sure about. My initial thought was some sort of cache problem, but you're only accessing each item once.
The other code could be faster for a coupe reasons.
1) The loops count down instead of up. Comparing a loop counter to zero costs nothing on most architectures (a flag is set by the decrement automatically) you have to explicitly compare to a max value with each iteration.
2) There is no math in the inner loop. You are doing a bunch of math in your inner loop. I see 2 subtractions in the main code and a multiply in the macro (which is used twice). There is also the implicit addition of the resulting indexes to the base address of the array which is avoided by the use of pointers (good addressing modes on x86 should eliminate this penalty too).
When writing optimized code, you always construct it bottom up from the inside. This means taking the inner-most loop and reducing its content to nearly zero. In this case, moving data is unavoidable. Incrementing a pointer is the bare minimum to get to the next item, the other pointer needs to add an offset to get to its next item. So at a minimum we have 4 operations: load, store, increment, add. If an architecture supported "move with post-increment" this would be 2 instructions total. On Intel I suspect it's 3 or 4 instructions. Anything more than this like subtractions and multiplication is going to add significant code.
Looking at the assembly code of each version should offer much insight.
If you run this repeatedly on a small matrix (32x32) that fits completely in cache you should should see even more dramatic differences in implementations. Running on a 1024x1024 matrix will be much slower than doing 1024 rotations of a single 32x32 even though the number of data copies is the same.
The main purpose of loop unrolling is to reduce the time spent on the loop control (test for completion, incrementing counters, etc...). This is a case of diminishing returns though, since as the loop is unrolled more and more, the time spent on loop control becomes less and less significant. Like mmr said, loop unrolling may also help the compiler to execute things in parallel, but only up to a point.
The "data-zone" algorithm appears to be a version of a cache efficient matrix transpose algorithm. The problem with computing a transpose the naive way is that it results in a lot of cache misses. For the source array, you are accessing the memory along each row, so it is accessed in a linear manner, element-by-element. However, this requires that you access the destination array along the columns, meaning you are jumping dim elements each time you access an element. Basically, for each row of the input, you are traversing the memory of the entire destination matrix. Since the whole matrix probably won't fit in the cache, memory has to be loaded and unloaded from the cache very often.
The "data-zone" algorithm takes the matrix that you are accessing by column and only performs the transpose for 32 rows at a time, so the amount of memory you are traversing is 32xstride, which should hopefully fit completely into the cache. Basically the aim is to work on sub-sections that fit in the cache and reduce the amount of jumping around in memory.
