Dereferencing iterators performance - c

I have 3 functions that come down to the code below, that runs 800x800 times:
Each while loop below runs for exactly 800 times before iter1 == lim, so the duration was measured as it ran 800x800x800 (512 millions) times.
iter1, iter2 and lim are double pointers. They point to a large enough vector of double's.
sum is a double local variable.
s1 and s2 are local unsigned int's, both equal to 800.
First runs in 2.257 seconds:
while ( iter1 < lim )
{
sum += *iter1 * *iter2;
++iter1;
iter2 += s2;
}
Second runs in 7.364 seconds:
while ( iter1 < lim )
{
sum += *iter1 * *iter2;
iter1 += s1;
iter2 += s2;
}
Third runs in 1.355 seconds:
while ( iter1 < lim )
{
sum += *iter1 * *iter2;
++iter1;
++iter2;
}
If I remove the sum += *iter1 * *iter2; instruction from each of them, they all run in around 1.07 seconds.
If I remove the second multiplication and change the instruction to sum += *iter1;, the first and third run in 1.33 seconds, while the second runs in 1.46 seconds.
If I remove the other iterator, like this: sum += *iter2;, then the first and second run in around 2.2 seconds, while the third runs in 1.35 seconds.
Obviously, the performance drop is tied to the quantity added to iter1 and iter2. I am no expert in how the processor accesses memory and dereferences pointers, so I hope someone in the community knows more than me and is willing to shed some light on my problem.
If you need any information about the hardware I ran these tests on, or anything else that can prove helpful, feel free to ask in the comments.
EDIT: The problem is that the second function was slow, when compared to the others, and wanted to know if there is anything I can do to make it run faster, as it appeared to be doing similar things like the other 2.
EDIT 2: All the measurements were made in Release build

This is just a manifestation of data locality.
It takes less time to look at something at the next page of a book than something at the 800th next page. Try for yourself at home.

The performance difference has nothing to do with the iterators. The difference is in the extra cache misses from advancing through a large amount of data with greater than unit stride.

My guess is the resulting machine code, depending on the sofistication of your compiler/platform.
To retrieve a pointer value, the internal machine will utilize something like a LOAD instruction, lets call the fictional assembler code LD addr0.
addr0 refers to the address register that is used.
A lot of CPUs provide statements like LD addr0+ that increment the address after loading the stored value. Often, this additional increment does not lead to any extra cycles.
I worked with some compilers that could only generate the addr0+ statements if the address increment is done by the increment operator directly after or in the dereferenciation statement.
So the last one could be the example with the most efficient machine code.
It would be interesting if you could post the resulting assembler code of the compilation process for each of the example.

Related

Matrix multiplication in 2 different ways (comparing time)

I've got an assignment - compare 2 matrix multiplications - in the default way, and multiplication after transposition of second matrix, we should point the difference which method is faster. I've written something like this below, but time and time2 are nearly equal to each other. In one case the first method is faster, I run the multiplication with the same size of matrix, and in another one the second method is faster. Is something done wrong? Should I change something in my code?
clock_t start = clock();
int sum;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
sum = 0;
for(int k=0; k<size; ++k) {
sum = sum + (m1[i][k] * m2[k][j]);
}
score[i][j] = sum;
}
}
clock_t end = clock();
double time = (end-start)/(double)CLOCKS_PER_SEC;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
int temp = m2[i][j];
m2[i][j] = m2[j][i];
m2[j][i] = temp;
}
}
clock_t start2 = clock();
int sum2;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
sum2 = 0;
for(int k=0; k<size; ++k) {
sum2 = sum2 + (m1[k][i] * m2[k][j]);
}
score[i][j] = sum2;
}
}
clock_t end2 = clock();
double time2 = (end2-start2)/(double)CLOCKS_PER_SEC;
You have multiple severe issues with your code and/or your understanding. Let me try to explain.
Matrix multiplication is bottlenecked by the rate at which the processor can load and store the values to memory. Most current architectures use cache to help with this. Data is moved from memory to cache and from cache to memory in blocks. To maximize the benefit of caching, you want to make sure you will use all the data in that block. To do that, you make sure you access the data sequentially in memory.
In C, multi-dimensional arrays are specified in row-major order. It means that the rightmost index is consecutive in memory; i.e. that a[i][k] and a[i][k+1] are consecutive in memory.
Depending on the architecture, the time it takes for the processor to wait (and do nothing) for the data to be moved from RAM to cache (and vice versa), may or may not be included in the CPU time (that e.g. clock() measures, albeit at a very poor resolution). For this kind of measurement ("microbenchmark"), it is much better to measure and report both CPU and real (or wall clock) time used; especially so if the microbenchmark is run on different machines, to get a better idea of the practical impact of the change.
There will be a lot of variation, so typically, you measure the time taken by a few hundred repeats (each repeat possibly making more than one operation; enough to be easily measured), storing the duration of each, and report their median. Why median, and not minimum, maximum, average? Because there will always be occasional glitches (unreasonable measurement due to an external event, or something), which typically yield a much higher value than normal; this makes the maximum uninteresting, and skews the average (mean) unless removed. The minimum is typically an over-optimistic case, where everything just happened to go perfectly; that rarely occurs in practice, so is only a curiosity, not of practical interest. The median time, on the other hand, gives you a practical measurement: you can expect 50% of all runs of your test case to take no more than the median time measured.
On POSIXy systems (Linux, Mac, BSDs), you should use clock_gettime() to measure the time. The struct timespec format has nanosecond precision (1 second = 1,000,000,000 nanoseconds), but resolution may be smaller (i.e., the clocks change by more than 1 nanosecond, whenever they change). I personally use
#define _POSIX_C_SOURCE 200809L
#include <time.h>
static struct timespec cpu_start, wall_start;
double cpu_seconds, wall_seconds;
void timing_start(void)
{
clock_gettime(CLOCK_REALTIME, &wall_start);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &cpu_start);
}
void timing_stop(void)
{
struct timespec cpu_end, wall_end;
clock_gettime(CLOCK_REALTIME, &wall_end);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &cpu_end);
wall_seconds = (double)(wall_end.tv_sec - wall_start.tv_sec)
+ (double)(wall_end.tv_nsec - wall_start.tv_nsec) / 1000000000.0;
cpu_seconds = (double)(cpu_end.tv_sec - cpu_start.tv_sec)
+ (double)(cpu_end.tv_nsec - cpu_start.tv_nsec) / 1000000000.0;
}
You call timing_start() before the operation, and timing_stop() after the operation; then, cpu_seconds contains the amount of CPU time taken and wall_seconds the real wall clock time taken (both in seconds, use e.g. %.9f to print all meaningful decimals).
The above won't work on Windows, because Microsoft does not want your C code to be portable to other systems. It prefers to develop their own "standard" instead. (Those C11 "safe" _s() I/O function variants are a stupid sham, compared to e.g. POSIX getline(), or the state of wide character support on all systems except Windows.)
Matrix multiplication is
c[r][c] = a[r][0] * b[0][c]
+ a[r][1] * b[1][c]
: :
+ a[r][L] * b[L][c]
where a has L+1 columns, and b has L+1 rows.
In order to make the summation loop use consecutive elements, we need to transpose b. If B[c][r] = b[r][c], then
c[r][c] = a[r][0] * B[c][0]
+ a[r][1] * B[c][1]
: :
+ a[r][L] * B[c][L]
Note that it suffices that a and B are consecutive in memory, but separate (possibly "far" away from each other), for the processor to utilize cache efficiently in such cases.
OP uses a simple loop, similar to the following pseudocode, to transpose b:
For r in rows:
For c in columns:
temporary = b[r][c]
b[r][c] = b[c][r]
b[c][r] = temporary
End For
End For
The problem above is that each element participates in a swap twice. For example, if b has 10 rows and columns, r = 3, c = 5 swaps b[3][5] and b[5][3], but then later, r = 5, c = 3 swaps b[5][3] and b[3][5] again! Essentially, the double loop ends up restoring the matrix to the original order; it does not do a transpose.
Consider the following entries and the actual transpose:
b[0][0] b[0][1] b[0][2] b[0][0] b[1][0] b[2][0]
b[1][0] b[1][1] b[1][2] ⇔ b[0][1] b[1][1] b[2][1]
b[2][0] b[2][1] b[2][2] b[0][2] b[1][2] b[2][2]
The diagonal entries are not swapped. You only need to do the swap in the upper triangular portion (where c > r) or in the lower triangular portion (where r > c), to swap all entries, because each swap swaps one entry from the upper triangular to the lower triangular, and vice versa.
So, to recap:
Is something done wrong?
Yes. Your transpose does nothing. You haven't understood the reason why one would want to transpose the second matrix. Your time measurement relies on a low-precision CPU time, which may not reflect the time taken by moving data between RAM and CPU cache. In the second test case, with m2 "transposed" (except it isn't, because you swap each element pair twice, returning them back to the way they were), your innermost loop is over the leftmost array index, which means it calculates the wrong result. (Moreover, because consecutive iterations of the innermost loop accesses items far from each other in memory, it is anti-optimized: it uses the pattern that is worst in terms of speed.)
All the above may sound harsh, but it isn't intended to be, at all. I do not know you, and I am not trying to evaluate you; I am only pointing out the errors in this particular answer, in your current understanding, and only in the hopes that it helps you, and anyone else encountering this question in similar circumstances, to learn.

Segmentation fault when trying to use intrinsics specifically _mm256_storeu_pd()

Seemed to have fixed it myself by type casting the cij2 pointer inside the mm256 call
so _mm256_storeu_pd((double *)cij2,vecC);
I have no idea why this changed anything...
I'm writing some code and trying to take advantage of the Intel manual vectorization. But whenever I run the code I get a segmentation fault on trying to use my double *cij2.
if( q == 0)
{
__m256d vecA;
__m256d vecB;
__m256d vecC;
for (int i = 0; i < M; ++i)
for (int j = 0; j < N; ++j)
{
double cij = C[i+j*lda];
double *cij2 = (double *)malloc(4*sizeof(double));
for (int k = 0; k < K; k+=4)
{
vecA = _mm256_load_pd(&A[i+k*lda]);
vecB = _mm256_load_pd(&B[k+j*lda]);
vecC = _mm256_mul_pd(vecA,vecB);
_mm256_storeu_pd(cij2, vecC);
for (int x = 0; x < 4; x++)
{
cij += cij2[x];
}
}
C[i+j*lda] = cij;
}
I've pinpointed the problem to the cij2 pointer. If i comment out the 2 lines that include that pointer the code runs fine, it doesn't work like it should but it'll actually run.
My question is why would i get a segmentation fault here? I know I've allocated the memory correctly and that the memory is a 256 vector of double's with size 64 bits.
After reading the comments I've come to add some clarification.
First thing I did was change the _mm_malloc to just a normal allocation using malloc. Shouldn't affect either way but will give me some more breathing room theoretically.
Second the problem isn't coming from a null return on the allocation, I added a couple loops in to increment through the array and make sure I could modify the memory without it crashing so I'm relatively sure that isn't the problem. The problem seems to stem from the loading of the data from vecC to the array.
Lastly I can not use BLAS calls. This is for a parallelisms class. I know it would be much simpler to call on something way smarter than I but unfortunately I'll get a 0 if I try that.
You dynamically allocate double *cij2 = (double *)malloc(4*sizeof(double)); but you never free it. This is just silly. Use double cij2[4], especially if you're not going to bother to align it. You never need more than one scratch buffer at once, and it's a small fixed size, so just use automatic storage.
In C++11, you'd use alignas(32) double cij2[4] so you could use _mm256_store_pd instead of storeu. (Or just to make sure storeu isn't slowed down by an unaligned address).
If you actually want to debug your original, use a debugger to catch it when it segfaults, and look at the pointer value. Make sure it's something sensible.
Your methods for testing that the memory was valid (like looping over it, or commenting stuff out) sound like they could lead to a lot of your loop being optimized away, so the problem wouldn't happen.
When your program crashes, you can also look at the asm instructions. Vector intrinsics map fairly directly to x86 asm (except when the compiler sees a more efficient way).
Your implementation would suck a lot less if you pulled the horizontal sum out of the loop over k. Instead of storing each multiply result and horizontally adding it, use a vector add into a vector accumulator. hsum it outside the loop over k.
__m256d cij_vec = _mm256_setzero_pd();
for (int k = 0; k < K; k+=4) {
vecA = _mm256_load_pd(&A[i+k*lda]);
vecB = _mm256_load_pd(&B[k+j*lda]);
vecC = _mm256_mul_pd(vecA,vecB);
cij_vec = _mm256_add_pd(cij_vec, vecC); // TODO: use multiple accumulators to keep multiple VADDPD or VFMAPD instructions in flight.
}
C[i+j*lda] = hsum256_pd(cij_vec); // put the horizontal sum in an inline function
For good hsum256_pd implementations (other than storing to memory and using a scalar loop), see Fastest way to do horizontal float vector sum on x86 (I included an AVX version there. It should be easy to adapt the pattern of shuffling to 256b double-precision.) This will help your code a lot, since you still have O(N^2) horizontal sums (but not O(N^3) with this change).
Ideally you could accumulate results for 4 i values in parallel, and not need horizontal sums.
VADDPD has a latency of 3 to 4 clocks, and a throughput of one per 1 to 0.5 clocks, so you need from 3 to 8 vector accumulators to saturate the execution units. Or with FMA, up to 10 vector accumulators (e.g. on Haswell where FMA...PD has 5c latency and one per 0.5c throughput). See Agner Fog's instruction tables and optimization guides to learn more about that. Also the x86 tag wiki.
Also, ideally nest your loops in a way that gave you contiguous access to two of your three arrays, since cache access patterns are critical for matmul (lots of data reuse). Even if you don't get fancy and transpose small blocks at a time that fit in cache. Even transposing one of your input matrices can be a win, since that costs O(N^2) and speeds up the O(N^3) process. I see your inner loop currently has a stride of lda while accessing A[].

Why does copying a 2D array column by column take longer than row by row in C? [duplicate]

This question already has answers here:
Why does the order of the loops affect performance when iterating over a 2D array?
(7 answers)
Closed 7 years ago.
#include <stdio.h>
#include <time.h>
#define N 32768
char a[N][N];
char b[N][N];
int main() {
int i, j;
printf("address of a[%d][%d] = %p\n", N, N, &a[N][N]);
printf("address of b[%5d][%5d] = %p\n", 0, 0, &b[0][0]);
clock_t start = clock();
for (j = 0; j < N; j++)
for (i = 0; i < N; i++)
a[i][j] = b[i][j];
clock_t end = clock();
float seconds = (float)(end - start) / CLOCKS_PER_SEC;
printf("time taken: %f secs\n", seconds);
start = clock();
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
a[i][j] = b[i][j];
end = clock();
seconds = (float)(end - start) / CLOCKS_PER_SEC;
printf("time taken: %f secs\n", seconds);
return 0;
}
Output:
address of a[32768][32768] = 0x80609080
address of b[ 0][ 0] = 0x601080
time taken: 18.063229 secs
time taken: 3.079248 secs
Why does column by column copying take almost 6 times as long as row by row copying? I understand that 2D array is basically an nxn size array where A[i][j] = A[i*n + j], but using simple algebra, I calculated that a Turing machine head (on main memory) would have to travel a distance of in both the cases. Here nxn is the size of the array and x is the distance between last element of first array and first element of second array.
It pretty much comes down to this image (source):
When accessing data, your CPU will not only load a single value, but will also load adjacent data into the CPU's L1 cache. When iterating through your array by row, the items that have automatically been loaded into the cache are actually the ones that are processed next. However, when you are iterating by column, each time an entire "cache line" of data (the size varies per CPU) is loaded, only a single item is used and then the next line has to be loaded, effectively making the cache pointless.
The wikipedia entry and, as a high level overview, this PDF should help you understand how CPU caches work.
Edit: chqrlie in the comments is of course correct. One of the relevant factors here is that only very few of your columns fit into the L1 cache at the same time. If your rows were much smaller (say, the total size of your two dimensional array was only some kilobytes) then you might not see a performance impact from iterating per-column.
While it's normal to draw the array as a rectangle, the addressing of array elements in memory is linear: 0 to one minus the number of bytes available (on nearly all machines).
Memory hierarchies (e.g. registers < L1 cache < L2 cache < RAM < swap space on disk) are optimized for the case where memory accesses are localized: accesses that are successive in time touch addresses that are close together. They are even more highly optimized (e.g. with pre-fetch strategies) for sequential access in linear order of addresses; e.g. 100,101,102...
In C, rectangular arrays are arranged in linear order by concatenating all the rows (other languages like FORTRAN and Common Lisp concatenate columns instead). Therefore the most efficient way to read or write the array is to do all the columns of the first row, then move on to the rest, row by row.
If you go down the columns instead, successive touches are N bytes apart, where N is the number of bytes in a row: 100, 10100, 20100, 30100... for the case N=10000 bytes.Then the second column is 101,10101, 20101, etc. This is the absolute worst case for most cache schemes.
In the very worst case, you can cause a page fault on each access. These days on even on an average machine it would take an enormous array to cause that. But if it happened, each touch could cost ~10ms for a head seek. Sequential access is a few nano-seconds per. That's over a factor of a million difference. Computation effectively stops in this case. It has a name: disk thrashing.
In a more normal case where only cache faults are involved, not page faults, you might see a factor of hundred. Still worth paying attention.
There are 3 main aspects that contribute to the timing different:
The first double loop accesses both arrays for the first time. You are actually reading uninitialized memory which is bad if you expect any meaningful results (functionally as well as timing-wise), but in terms of timing what plays part here is the fact that these addresses are cold, and reside in the main memory (if you're lucky), or aren't even paged (if you're less lucky). In the latter case, you would have a page fault on each new page, and would invoke a system call to allocate a page for the first time. Note that this doesn't have anything to do with the order of traversal, but simply because the first access is much slower. To avoid that, initialize both arrays to some value.
Cache line locality (as explained in the other answers) - if you access sequential data, you miss once per line, and then enjoy the benefit of having it fetched already. You most likely won't even hit it in the cache but rather in some buffer, since the consecutive requests will be waiting for that line to get fetched. When accessing column-wise, you would fetch the line, cache it, but if the reuse distance is large enough - you would lose it and have to fetch it again.
Prefetching - modern CPUs would have HW prefetching mechanisms that can detect sequential accesses and prefetch the data ahead of time, which will eliminate even the first miss of each line. Most CPUs also have stride based prefetches which may be able to cover the column size, but these things don't work well usually with matrix structures since you have too many columns and it would be impossible for HW to track all these stride flows simultaneously.
As a side note, I would recommend that any timing measurement would be performed multiple times and amortized - that would have eliminated problem #1.

C cache optimization for direct mapped cache

Having some trouble figuring out the hit and miss rates of the following two snippets of code.
Given info: we have a 1024 Byte direct-mapped cache with block sizes of 16 bytes. So that makes 64 lines (sets in this case) then. Assume the cache starts empty. Consider the following code:
struct pos {
int x;
int y;
};
struct pos grid[16][16];
int total_x = 0; int total_y = 0;
void function1() {
int i, j;
for (i = 0; i < 16; i++) {
for (j = 0; j < 16; j++) {
total_x += grid[j][i].x;
total_y += grid[j][i].y;
}
}
}
void function2() {
int i, j;
for (i = 0; i < 16; i++) {
for (j = 0; j < 16; j++) {
total_x += grid[i][j].x;
total_y += grid[i][j].y;
}
}
}
I can tell from some basic rules (i.e. C arrays are row-major order) that function2 should be better. But I don't understand how to calculate the hit/miss percentages. Apparently function1() misses 50% of the time, while function2() only misses 25% of the time.
Could somebody walk me through how those calculations work? All I can really see is that no more than half the grid will ever fit inside the cache at once. Also, is this concept easy to extend to k-way associative caches?
Thanks.
How data are stored in memory
Every structure pos has a size of 8 Bytes, thus the total size of pos[16][16] is 2048 Bytes. And the order of the array are as follows:
pos[0][0] pos[0][1] pos[0][2] ...... pos[0][15] pos[1]0[] ...... pos[1][15].......pos[15][0] ......pos[15][15]
The cache organization compared to the data
For the cache, each block is 16 Bytes, which is the same size as two elements of the array. The Entire cache is 1024 Bytes, which is half the size of the entire array. Since cache is direct-mapped, that means if we label the cache block from 0 to 63, we can safely assume that the mapping should look like this
------------ memory----------------------------cache
pos[0][0] pos[0][1] -----------> block 0
pos[0][2] pos[0][3] -----------> block 1
pos[0][4] pos[0][5] -----------> block 2
pos[0][14] pos[0][15] --------> block 7
.......
pos[1][0] pos[1][1] -----------> block 8
pos[1][2] pos[1][3] -----------> block 9
.......
pos[7][14] pos[7][15] --------> block 63
pos[8][0] pos[8][1] -----------> block 0
.......
pos[15][14] pos[15][15] -----> block 63
How function1 manipulates memory
The loop follows a column-wise inner loop, that means the first iteration loads pos[0][0] and pos[0][1] to cache block 0, the second iteration loads pos[1][0] and pos[1][1] to cache block 8. Caches are cold, so the first column x is always miss, while y is always hit. The second column data are supposedly all loaded in cache during the first column access, but this is NOT the case. Since pos[8][0] access has already evict the former pos[0][0] page(they both map to block 0!).So on, the miss rate is 50%.
How function2 manipulates memory
The second function has nice stride-1 access pattern. That means when accessing pos[0][0].x pos[0][0].y pos[0][1].x pos[0][1].y only the first one is a miss due to the cold cache. The following patterns are all the same. So the miss rate is only 25%.
K-way associative cache follows the same analysis, although that may be more tedious. For getting the most out of the cache system, try to initiate a nice access pattern, say stride-1, and use the data as much as possible during each loading from memory. Real world cpu microarchitecture employs other intelligent design and algorithm to enhance the efficiency. The best method is always to measure the time in real world, dump the core code, and do a thorough analysis.
Ok, my computer science lectures are a bit far off but I think I figured it out (it's actually a very easy example when you think about it).
Your struct is 8 byte long (2 x 4). Since your cache blocks are 16 bytes, a memory access grid[i][j] will fetch exactly two struct entries (grid[i][j] and grid[i][j+1]). Therefore, if you loop through the second index only every 4th access will lead to a memory read. If you loop through the first index, you probably throw away the second entry that has been fetched, that depends on the number of fetches in the inner loop vs. the overall cache-size though.
Now we have to think about the cache size as well: You say that you have 64 lines that are directly mapped. In function 1, an inner loop is 16 fetches. That means, the 17th fetch you get to grid[j][i+1]. This should actually be a hit, since it should have been kept in the cache since the last inner loop walk. Every second inner loop should therefore only consist of hits.
Well, if my reasonings are correct, the answer that has been given to you should be wrong. Both functions should perform with 25% misses. Maybe someone finds a better answer but if you understand my reasoning I'd ask a TA about that.
Edit: Thinking about it again, we should first define what actually qualifies as a miss/hit. When you look at
total_x += grid[j][i].x;
total_y += grid[j][i].y;
are these defined as two memory accesses or one? A decent compiler with optimization settings should optimize this to
pos temp = grid[j][i];
total_x += temp.x;
total_y += temp.y;
which could be counted as one memory access. I therefore propose the universal answer to all CS questions: "It depends."

algorithm comparison in C, what's the difference?

#define IMGX 8192
#define IMGY 8192
int red_freq[256];
char img[IMGY][IMGX][3];
main(){
int i, j;
long long total;
long long redness;
for (i = 0; i < 256; i++)
red_freq[i] = 0;
for (i = 0; i < IMGY; i++)
for (j = 0; j < IMGX; j++)
red_freq[img[i][j][0]] += 1;
total = 0;
for (i = 0; i < 256; i++)
total += (long long)i * (long long)red_freq[i];
redness = (total + (IMGX*IMGY/2))/(IMGX*IMGY);
what's the difference when you replace the second for loop into
for (j = 0; j < IMGX; j++)
for (i = 0; i < IMGY; i++)
red_freq[img[i][j][0]] += 1;
everything else are stay the same and why the first algorithm is faster than then second algorithm ?
Does it have something to do with the memory allocation?
The first version alters memory in sequence, so uses the processor cache optimally.
The second version uses one value from each cache line it loads, so it pessimal for cache use.
The point to understand is that the cache is divided into lines, each of which will contain many values in the overall structure.
The first version might also be optimized by the compiler to use more clever instructions (SIMD instructions) which would be even faster.
It is because the first version is iterating through the memory in the order that it is physically laid out, while the second one is jumping around in memory from one column in the array to the next. This will cause cache thrashing and interfere with the optimal performance of the CPU, which then has to spend lots of time waiting for the cache to be refreshed over and over again.
It's because big modern processor architectures (like the one in a PC) are massively optimised to work on memory which is 'near' (in address-related terms) memory which they've recently accessed. Actual physical memory access is much, much slower than the CPU can theoretically run, so everything which helps the process do its access in the most efficient fashion helps with performance.
It's pretty much impossibly to generalise more than that, but 'locality of reference' is a good thing to aim for.
Due to how the memory is laid out the first version maintains data locality and therefore causes less cache misses.
memory allocation happens only once and it is at the beginning so it can not be the reason. the reason is how the runtime calculates the address. In both cases memory address is calculated as
(i * (IMGY * IMGX)) + (j * IMGX) + 0
In the first algorithm
(i * (IMGY * IMGX)) gets calculates 8192 times
(j * IMGX) gets calculated 8192 * 8192 times
In the second algorithm
(i * (IMGY * IMGX)) gets calculates 8192 * 8192 times
(j * IMGX) gets calculated 8192 times
Since
(i * (IMGY * IMGX))
involves two multiplications, doing it more takes more time. that is the reason
Yes it has something to do with memory allocation. The first loop indexes the inner dimension of img, which happens to span over only 3 bytes each time. That's within one memory page easily (i believe a common size here is 4kB for one page). But with your second version, the outer dimension's index changes fast. That will cause memory reads spread over a much larger range of memory - namely sizeof (char[IMGX][3]) bytes, which is 24kB. And with each change of the inner index, those jumps start to happen again. That will hit different pages and is probably somewhat slower. Also i heard the CPU reads ahead memory. That will make the first version benefit, because at the time it reads, that data is probably already in the cache. I can imagine the second version doesn't benefit from that, because it makes those large jumps around the memory back and forth.
I would suspect the difference is not that much, but if the algorithm runs many times, it eventually becomes noticeable. You probably want to read the article Row-major Order on wikipedia. That is the scheme used to store multi-dimensional arrays in C.

Resources