Optimization via loop blocking in C - c

I'm currently studying C optimizations and had a past assignment optimizing a piece of code. Among other optimizations (unrolling loops and strength reduction) I used blocking according to cache size (following Intel's tutorial on the matter):
https://software.intel.com/en-us/articles/how-to-use-loop-blocking-to-optimize-memory-use-on-32-bit-intel-architecture.
Now I think I understand why this technique works in that case, where the stride is one, it loads into cache the blocks and reduced the number of misses when accessing the next place in memory. But in my code dst[dim * jj + ii] seems to jump around all over the place since it is being multiplied by jj in the innermost loop. How does the cache account for this? dim is multiplied by 0 then 1 then 2 etc at some point it will surpass what the block can hold and the optimization will be pointless. Am I understanding this right?
In practice, however, when I used blocking only for the jj variable I didn't get the speed up in performance I did from using blocking on both ii and jj. So I made it faster but don't know why. The assignment is past now, but I still don't understand and it's quite frustrating.
Thank you in advance for bearing with what may be a very stupid question.
void transpose(int *dst, int *src, int dim)
{
int i, j, dimi, jj,ii;
dimi = 0;
for(i=0; i < dim; i+=block_size)
{
for(j=0; j<dim; j+=block_size)
{
for(ii = i; ii < i+block_size; ii++)
{
dimi = dim * ii;
for(jj = j; jj < j+block_size; jj++)
{
dst[dim*jj + ii] = src[dimi + jj];
}
}
}
}
}

You have poor spatial locality in dst, but with blocking for both dimensions there's still enough locality in time and space combined that cache lines are typically still hot in L1d cache when you store the next int.
Let's say that dst[dim*jj + ii] is the first int in a cache line. The store to dst[dim*jj + ii + 1] will be in the same cache line. If that line is still hot in L1d cache, the CPU hasn't spent any bandwidth on evicting the dirty line do L2 and then bringing it back into L1d for the next store.
With blocking for both dimensions, that next store will happen after block_size more stores to dst[ dim*(jj+1..block_size-1) + ii ]. (Next iteration of the ii loop.)
If dim and block_size are both powers of 2, the line will probably be evicted because of conflicts. Addresses 4kiB apart go to the same set in L1d, although the problematic stride is larger for L2. (Intel's L1d caches are 32kiB and 8-way set associative, so as few as 8 more stores to the same set will probably evict a line. But L3 cache uses a hash function for set indexing, instead of a simple modulo by using a range of address bits directly. IDK how bit your buffers are, or your whole matrix can stay hot in your L3 cache.)
But if either dim or block_size aren't a power of 2, then all 64 sets of 8 lines of 64 bytes (L1d) come into play. So up to 64*8 = 512 dirty lines could be in L1d cache. But remember there's still the data being loaded sequentially, and that will take some space. (Not much, because you're reading 16 ints consecutively from each line of loaded data, and using that to dirty 16 different lines of destination data.)
With blocking only in 1 dimension, you're doing many more stores before you come back to a destination line, so it will probably have been evicted to L2 or maybe L3 by then.
BTW, I put your code on the Godbolt compiler explorer (https://godbolt.org/g/g24ehr), and gcc -O3 for x86 doesn't try to do anything special. It uses a vector load into an XMM register, and unpacks with shuffles and does 4 separate int stores.
clang6.0 does something interesting, involving copying a block of 256 bytes. IDK if it's doing this to work around aliasing (because without int *restrict dst it doesn't know that src and dst don't overlap).
BTW, contiguous writes and scattered reads would probably be better. (i.e. invert your inner two loops, so ii changes in the inner-most loop instead of jj). Evicting a dirty cache line is more expensive than evicting a clean line and just re-reading it again later.

Related

Why does copying a 2D array column by column take longer than row by row in C? [duplicate]

This question already has answers here:
Why does the order of the loops affect performance when iterating over a 2D array?
(7 answers)
Closed 7 years ago.
#include <stdio.h>
#include <time.h>
#define N 32768
char a[N][N];
char b[N][N];
int main() {
int i, j;
printf("address of a[%d][%d] = %p\n", N, N, &a[N][N]);
printf("address of b[%5d][%5d] = %p\n", 0, 0, &b[0][0]);
clock_t start = clock();
for (j = 0; j < N; j++)
for (i = 0; i < N; i++)
a[i][j] = b[i][j];
clock_t end = clock();
float seconds = (float)(end - start) / CLOCKS_PER_SEC;
printf("time taken: %f secs\n", seconds);
start = clock();
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
a[i][j] = b[i][j];
end = clock();
seconds = (float)(end - start) / CLOCKS_PER_SEC;
printf("time taken: %f secs\n", seconds);
return 0;
}
Output:
address of a[32768][32768] = 0x80609080
address of b[ 0][ 0] = 0x601080
time taken: 18.063229 secs
time taken: 3.079248 secs
Why does column by column copying take almost 6 times as long as row by row copying? I understand that 2D array is basically an nxn size array where A[i][j] = A[i*n + j], but using simple algebra, I calculated that a Turing machine head (on main memory) would have to travel a distance of in both the cases. Here nxn is the size of the array and x is the distance between last element of first array and first element of second array.
It pretty much comes down to this image (source):
When accessing data, your CPU will not only load a single value, but will also load adjacent data into the CPU's L1 cache. When iterating through your array by row, the items that have automatically been loaded into the cache are actually the ones that are processed next. However, when you are iterating by column, each time an entire "cache line" of data (the size varies per CPU) is loaded, only a single item is used and then the next line has to be loaded, effectively making the cache pointless.
The wikipedia entry and, as a high level overview, this PDF should help you understand how CPU caches work.
Edit: chqrlie in the comments is of course correct. One of the relevant factors here is that only very few of your columns fit into the L1 cache at the same time. If your rows were much smaller (say, the total size of your two dimensional array was only some kilobytes) then you might not see a performance impact from iterating per-column.
While it's normal to draw the array as a rectangle, the addressing of array elements in memory is linear: 0 to one minus the number of bytes available (on nearly all machines).
Memory hierarchies (e.g. registers < L1 cache < L2 cache < RAM < swap space on disk) are optimized for the case where memory accesses are localized: accesses that are successive in time touch addresses that are close together. They are even more highly optimized (e.g. with pre-fetch strategies) for sequential access in linear order of addresses; e.g. 100,101,102...
In C, rectangular arrays are arranged in linear order by concatenating all the rows (other languages like FORTRAN and Common Lisp concatenate columns instead). Therefore the most efficient way to read or write the array is to do all the columns of the first row, then move on to the rest, row by row.
If you go down the columns instead, successive touches are N bytes apart, where N is the number of bytes in a row: 100, 10100, 20100, 30100... for the case N=10000 bytes.Then the second column is 101,10101, 20101, etc. This is the absolute worst case for most cache schemes.
In the very worst case, you can cause a page fault on each access. These days on even on an average machine it would take an enormous array to cause that. But if it happened, each touch could cost ~10ms for a head seek. Sequential access is a few nano-seconds per. That's over a factor of a million difference. Computation effectively stops in this case. It has a name: disk thrashing.
In a more normal case where only cache faults are involved, not page faults, you might see a factor of hundred. Still worth paying attention.
There are 3 main aspects that contribute to the timing different:
The first double loop accesses both arrays for the first time. You are actually reading uninitialized memory which is bad if you expect any meaningful results (functionally as well as timing-wise), but in terms of timing what plays part here is the fact that these addresses are cold, and reside in the main memory (if you're lucky), or aren't even paged (if you're less lucky). In the latter case, you would have a page fault on each new page, and would invoke a system call to allocate a page for the first time. Note that this doesn't have anything to do with the order of traversal, but simply because the first access is much slower. To avoid that, initialize both arrays to some value.
Cache line locality (as explained in the other answers) - if you access sequential data, you miss once per line, and then enjoy the benefit of having it fetched already. You most likely won't even hit it in the cache but rather in some buffer, since the consecutive requests will be waiting for that line to get fetched. When accessing column-wise, you would fetch the line, cache it, but if the reuse distance is large enough - you would lose it and have to fetch it again.
Prefetching - modern CPUs would have HW prefetching mechanisms that can detect sequential accesses and prefetch the data ahead of time, which will eliminate even the first miss of each line. Most CPUs also have stride based prefetches which may be able to cover the column size, but these things don't work well usually with matrix structures since you have too many columns and it would be impossible for HW to track all these stride flows simultaneously.
As a side note, I would recommend that any timing measurement would be performed multiple times and amortized - that would have eliminated problem #1.

Understanding how to write cache-friendly code

I have been trying to understand how to write the cache-friendly code. So as a first step, i was trying to understand the performance difference between array row-major access and column major access.
So I created an int array of size 512×512 so that total size is 1MB. My L1 cache is 32KB, L2 cache is 256KB, and L3 cache is 3MB. So my array fits in L3 cache.
I simply calculated the sum of array elements in row major order and column major order and compared their speed. All the time, column major order is slightly faster. i expected row major order to be faster than the other (may be several times faster).
I thought problem may be due to small size of array, so I made another array of size 8192×8192 (256 MB). Still the same result.
Below is the code snippet I used:
#include "time.h"
#include <stdio.h>
#define S 512
#define M S
#define N S
int main() {
// Summing in the row major order
int x = 0;
int iter = 25000;
int i, j;
int k[M][N];
int sum = 0;
clock_t start, end;
start = clock();
while(x < iter) {
for (i = 0; i < M; i++) {
for(j = 0; j < N; j++) {
sum += k[i][j];
}
}
x++;
}
end = clock();
printf("%i\n", end-start);
// Summing in the column major order
x = 0;
sum = 0;
int h[M][N];
start = clock();
while(x < iter) {
for (j = 0; j < N; j++) {
for(i = 0; i < M; i++){
sum += k[i][j];
}
}
x++;
}
end = clock();
printf("%i\n", end-start);
}
Question : can some one tell me what is my mistake and why I am getting this result?
I don't really know why you get this behaviour, but let me clarify some things.
There are at least 2 things to consider when thinking about cache: cache size and cache line size. For instance, my Intel i7 920 processor has a 256KB L2 Cache with 64 bytes line size. If your data fits inside the cache, then it really doesn't matter in which order you access it. All the problems of optimizing a code to be cache friendly must target 2 things: if possible split the access to the memory in blocks such in a way that a block fits in cache. Do all the computations possible with that block and then bring the next block, do the computations with it and so on. The other thing, (the one you are trying to do) is to access the memory in a consecutive way. When you request a data from the memory (lets say an int - 4 bytes) a whole cache line is brought to the cache (in my case 64 bytes: that is 16 adjacent integers (including the one you requested) are brought to cache). Here comes in play row-order vs column-order. With row order you have 1 cache miss for every 16 memory requests, with column order you get a cache-miss for every request (but only if your data doesn't fit in cache; if your data fits in cache, then you get the same ratio as with row-order because you still have the lines in cache, from way back when you requested the first element in the line; of course associativeness can come into play and a cache line can be rewritten even if not all cache is filled with your data).
Regarding your problem, when the data fits in cache, as I said, the access order doesn't matter that much, but when you do the second summing, the data is already in the cache from when you did the first sum, so that's why it is faster. If you do the column-order sum first you should see that the row-order sum becomes faster simply because is done after. However, when the data is large enough, you shouldn't get the same behaviour. Try the following: between the two sums, do something with another large data in order to invalidate the whole cache.
Edit
I see a 3-4x speedup for row major (although I expected >8x speedup. any idea why?). [..] it would be great if you could tell me why speedup is only 3x
Is not that accessing the matrix the "right way" doesn't improve much, is more like accessing the matrix the "wrong way" doesn't hurt that much, if that makes any sense.
Although I can't provide you with a specific and exact answer, what I can tell you is that modern processors have very complicated and extremely efficient cache models. They are so powerful that, for instance, in many common cases they can mask the cache levels, making to seem like instead of 3 level cache you have a big one level cache (you don't see a penalty when increasing your data size from a size that fits in L2 to a size that fits only in L3). Running your code in an older processor (lets say 10 years old) probably you will see the speedup you expect. Modern day processors however have mechanisms that help a lot with cache misses. Desktop processors are design with the philosophy of running "bad code" fast so a lot of investment is made in improving "bad code" performance because the vast majority of desktop applications aren't written by people who understand branching issues or cache models. This is opposed to the high-performance market where specialized processors make a bad code hurt very much because they implement weak mechanisms that deal with "bad code" (or don't implement at all). These mechanisms take up a lot of transistors and so they increase the power consumption and the heat generated, but they are worth implementing in a desktop processor where most of the code is "bad code".

C cache optimization for direct mapped cache

Having some trouble figuring out the hit and miss rates of the following two snippets of code.
Given info: we have a 1024 Byte direct-mapped cache with block sizes of 16 bytes. So that makes 64 lines (sets in this case) then. Assume the cache starts empty. Consider the following code:
struct pos {
int x;
int y;
};
struct pos grid[16][16];
int total_x = 0; int total_y = 0;
void function1() {
int i, j;
for (i = 0; i < 16; i++) {
for (j = 0; j < 16; j++) {
total_x += grid[j][i].x;
total_y += grid[j][i].y;
}
}
}
void function2() {
int i, j;
for (i = 0; i < 16; i++) {
for (j = 0; j < 16; j++) {
total_x += grid[i][j].x;
total_y += grid[i][j].y;
}
}
}
I can tell from some basic rules (i.e. C arrays are row-major order) that function2 should be better. But I don't understand how to calculate the hit/miss percentages. Apparently function1() misses 50% of the time, while function2() only misses 25% of the time.
Could somebody walk me through how those calculations work? All I can really see is that no more than half the grid will ever fit inside the cache at once. Also, is this concept easy to extend to k-way associative caches?
Thanks.
How data are stored in memory
Every structure pos has a size of 8 Bytes, thus the total size of pos[16][16] is 2048 Bytes. And the order of the array are as follows:
pos[0][0] pos[0][1] pos[0][2] ...... pos[0][15] pos[1]0[] ...... pos[1][15].......pos[15][0] ......pos[15][15]
The cache organization compared to the data
For the cache, each block is 16 Bytes, which is the same size as two elements of the array. The Entire cache is 1024 Bytes, which is half the size of the entire array. Since cache is direct-mapped, that means if we label the cache block from 0 to 63, we can safely assume that the mapping should look like this
------------ memory----------------------------cache
pos[0][0] pos[0][1] -----------> block 0
pos[0][2] pos[0][3] -----------> block 1
pos[0][4] pos[0][5] -----------> block 2
pos[0][14] pos[0][15] --------> block 7
.......
pos[1][0] pos[1][1] -----------> block 8
pos[1][2] pos[1][3] -----------> block 9
.......
pos[7][14] pos[7][15] --------> block 63
pos[8][0] pos[8][1] -----------> block 0
.......
pos[15][14] pos[15][15] -----> block 63
How function1 manipulates memory
The loop follows a column-wise inner loop, that means the first iteration loads pos[0][0] and pos[0][1] to cache block 0, the second iteration loads pos[1][0] and pos[1][1] to cache block 8. Caches are cold, so the first column x is always miss, while y is always hit. The second column data are supposedly all loaded in cache during the first column access, but this is NOT the case. Since pos[8][0] access has already evict the former pos[0][0] page(they both map to block 0!).So on, the miss rate is 50%.
How function2 manipulates memory
The second function has nice stride-1 access pattern. That means when accessing pos[0][0].x pos[0][0].y pos[0][1].x pos[0][1].y only the first one is a miss due to the cold cache. The following patterns are all the same. So the miss rate is only 25%.
K-way associative cache follows the same analysis, although that may be more tedious. For getting the most out of the cache system, try to initiate a nice access pattern, say stride-1, and use the data as much as possible during each loading from memory. Real world cpu microarchitecture employs other intelligent design and algorithm to enhance the efficiency. The best method is always to measure the time in real world, dump the core code, and do a thorough analysis.
Ok, my computer science lectures are a bit far off but I think I figured it out (it's actually a very easy example when you think about it).
Your struct is 8 byte long (2 x 4). Since your cache blocks are 16 bytes, a memory access grid[i][j] will fetch exactly two struct entries (grid[i][j] and grid[i][j+1]). Therefore, if you loop through the second index only every 4th access will lead to a memory read. If you loop through the first index, you probably throw away the second entry that has been fetched, that depends on the number of fetches in the inner loop vs. the overall cache-size though.
Now we have to think about the cache size as well: You say that you have 64 lines that are directly mapped. In function 1, an inner loop is 16 fetches. That means, the 17th fetch you get to grid[j][i+1]. This should actually be a hit, since it should have been kept in the cache since the last inner loop walk. Every second inner loop should therefore only consist of hits.
Well, if my reasonings are correct, the answer that has been given to you should be wrong. Both functions should perform with 25% misses. Maybe someone finds a better answer but if you understand my reasoning I'd ask a TA about that.
Edit: Thinking about it again, we should first define what actually qualifies as a miss/hit. When you look at
total_x += grid[j][i].x;
total_y += grid[j][i].y;
are these defined as two memory accesses or one? A decent compiler with optimization settings should optimize this to
pos temp = grid[j][i];
total_x += temp.x;
total_y += temp.y;
which could be counted as one memory access. I therefore propose the universal answer to all CS questions: "It depends."

Which ordering of nested loops for iterating over a 2D array is more efficient [duplicate]

This question already has answers here:
Why does the order of the loops affect performance when iterating over a 2D array?
(7 answers)
Closed 3 years ago.
Which of the following orderings of nested loops to iterate over a 2D array is more efficient in terms of time (cache performance)? Why?
int a[100][100];
for(i=0; i<100; i++)
{
for(j=0; j<100; j++)
{
a[i][j] = 10;
}
}
or
for(i=0; i<100; i++)
{
for(j=0; j<100; j++)
{
a[j][i] = 10;
}
}
The first method is slightly better, as the cells being assigned to lays next to each other.
First method:
[ ][ ][ ][ ][ ] ....
^1st assignment
^2nd assignment
[ ][ ][ ][ ][ ] ....
^101st assignment
Second method:
[ ][ ][ ][ ][ ] ....
^1st assignment
^101st assignment
[ ][ ][ ][ ][ ] ....
^2nd assignment
For array[100][100] - they are both the same, if the L1 cache is larger then 100*100*sizeof(int) == 10000*sizeof(int) == [usually] 40000. Note in Sandy Bridge - 100*100 integers should be enough elements to see a difference, since the L1 cache is only 32k.
Compilers will probably optimize this code all the same
Assuming no compiler optimizations, and matrix does not fit in L1 cache - the first code is better due to cache performance [usually]. Every time an element is not found in cache - you get a cache miss - and need to go to the RAM or L2 cache [which are much slower]. Taking elements from RAM to cache [cache fill] is done in blocks [usually 8/16 bytes] - so in the first code, you get at most miss rate of 1/4 [assuming 16 bytes cache block, 4 bytes ints] while in the second code it is unbounded, and can be even 1. In the second code snap - elements that were already in cache [inserted in the cache fill for the adjacent elements] - were taken out, and you get a redundant cache miss.
This is closely related to the principle of locality, which is the general assumption used when implementing the cache system. The first code follows this principle while the second doesn't - so cache performance of the first will be better of those of the second.
Conclusion:
For all cache implementations I am aware of - the first will be not worse then the second. They might be the same - if there is no cache at all or all the array fits in cache completely - or due to compiler optimization.
This sort of micro-optimization is platform-dependent so you'll need to profile the code in order to be able to draw a reasonable conclusion.
In your second snippet, the change in j in each iteration produces a pattern with low spatial locality. Remember that behind the scenes, an array reference computes:
( ((y) * (row->width)) + (x) )
Consider a simplified L1 cache that has enough space for only 50 rows of our array. For the first 50 iterations, you will pay the unavoidable cost for 50 cache misses, but then what happens? For each iteration from 50 to 99, you will still cache miss and have to fetch from L2 (and/or RAM, etc). Then, x changes to 1 and y starts over, leading to another cache miss because the first row of your array has been evicted from the cache, and so forth.
The first snippet does not have this problem. It accesses the array in row-major order, which achieves better locality - you only have to pay for cache misses at most once (if a row of your array is not present in the cache at the time the loop starts) per row.
That being said, this is a very architecture-dependent question, so you would have to take into consideration the specifics (L1 cache size, cache line size, etc.) to draw a conclusion. You should also measure both ways and keep track of hardware events to have concrete data to draw conclusions from.
Considering C++ is row major, I believe first method is going to be a bit faster. In memory a 2D array is represented in a Single dimension array and performance depends in accessing it either using row major or column major
This is a classic problem about cache line bouncing
In most time the first one is better, but I think the exactly answer is: IT DEPENDS, different architecture maybe different result.
In second method, Cache miss, because the cache stores contigous data.
hence the first method is efficient than second method.
In your case (fill all array 1 value), that will be faster:
for(j = 0; j < 100 * 100; j++){
a[j] = 10;
}
and you could still treat a as 2 dimensional array.
EDIT:
As Binyamin Sharet mentioned, you could do it if your a is declared that way:
int **a = new int*[100];
for(int i = 0; i < 100; i++){
a[i] = new int[100];
}
In general, better locality (noticed by most of responders) is only the first advantage for loop #1 performance.
The second (but related) advantage, is that for loops like #1 - compiler is normally capable to efficiently auto-vectorize the code with stride-1 memory access pattern (stride-1 means there is contiguous access to array elements one by one in every next iteration).
On the contrary, for loops like #2, auto-vectorizations will not normally work fine, because there is no consecutive stride-1 iterative access to contiguos blocks in memory.
Well, my answer is general. For very simple loops exactly like #1 or #2, there could be even simpler aggressive compiler optimizations used (grading any difference) and also compiler will normally be able to auto-vectorize #2 with stride-1 for outer loop (especially with #pragma simd or similar).
First option is better as we can store a[i] in a temp variable inside first loop and then lookup for j index in that. In this sense it can be said as cached variable.

algorithm comparison in C, what's the difference?

#define IMGX 8192
#define IMGY 8192
int red_freq[256];
char img[IMGY][IMGX][3];
main(){
int i, j;
long long total;
long long redness;
for (i = 0; i < 256; i++)
red_freq[i] = 0;
for (i = 0; i < IMGY; i++)
for (j = 0; j < IMGX; j++)
red_freq[img[i][j][0]] += 1;
total = 0;
for (i = 0; i < 256; i++)
total += (long long)i * (long long)red_freq[i];
redness = (total + (IMGX*IMGY/2))/(IMGX*IMGY);
what's the difference when you replace the second for loop into
for (j = 0; j < IMGX; j++)
for (i = 0; i < IMGY; i++)
red_freq[img[i][j][0]] += 1;
everything else are stay the same and why the first algorithm is faster than then second algorithm ?
Does it have something to do with the memory allocation?
The first version alters memory in sequence, so uses the processor cache optimally.
The second version uses one value from each cache line it loads, so it pessimal for cache use.
The point to understand is that the cache is divided into lines, each of which will contain many values in the overall structure.
The first version might also be optimized by the compiler to use more clever instructions (SIMD instructions) which would be even faster.
It is because the first version is iterating through the memory in the order that it is physically laid out, while the second one is jumping around in memory from one column in the array to the next. This will cause cache thrashing and interfere with the optimal performance of the CPU, which then has to spend lots of time waiting for the cache to be refreshed over and over again.
It's because big modern processor architectures (like the one in a PC) are massively optimised to work on memory which is 'near' (in address-related terms) memory which they've recently accessed. Actual physical memory access is much, much slower than the CPU can theoretically run, so everything which helps the process do its access in the most efficient fashion helps with performance.
It's pretty much impossibly to generalise more than that, but 'locality of reference' is a good thing to aim for.
Due to how the memory is laid out the first version maintains data locality and therefore causes less cache misses.
memory allocation happens only once and it is at the beginning so it can not be the reason. the reason is how the runtime calculates the address. In both cases memory address is calculated as
(i * (IMGY * IMGX)) + (j * IMGX) + 0
In the first algorithm
(i * (IMGY * IMGX)) gets calculates 8192 times
(j * IMGX) gets calculated 8192 * 8192 times
In the second algorithm
(i * (IMGY * IMGX)) gets calculates 8192 * 8192 times
(j * IMGX) gets calculated 8192 times
Since
(i * (IMGY * IMGX))
involves two multiplications, doing it more takes more time. that is the reason
Yes it has something to do with memory allocation. The first loop indexes the inner dimension of img, which happens to span over only 3 bytes each time. That's within one memory page easily (i believe a common size here is 4kB for one page). But with your second version, the outer dimension's index changes fast. That will cause memory reads spread over a much larger range of memory - namely sizeof (char[IMGX][3]) bytes, which is 24kB. And with each change of the inner index, those jumps start to happen again. That will hit different pages and is probably somewhat slower. Also i heard the CPU reads ahead memory. That will make the first version benefit, because at the time it reads, that data is probably already in the cache. I can imagine the second version doesn't benefit from that, because it makes those large jumps around the memory back and forth.
I would suspect the difference is not that much, but if the algorithm runs many times, it eventually becomes noticeable. You probably want to read the article Row-major Order on wikipedia. That is the scheme used to store multi-dimensional arrays in C.

Resources