Manually optimize a nested loop - c

I'm working on a homework assignment where I must manually optimize a nested loop (my program will be compiled with optimizations disabled). The goal of the assignment is to run the entire program in less than 6 seconds (extra credit for less than 4.5 seconds).
I'm only allowed to change a small block of code, and the starting point is such:
for (j=0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
Where ARRAY_SIZE is 9973. This loop is contained within another loop that is run 200,000 times. This particular version runs in 16 seconds.
What I've done so far is change the implementation to unroll the loop and use pointers as my iterator:
(These declarations are not looped over 200,000 times)
register int unroll_length = 16;
register int *unroll_end = array + (ARRAY_SIZE - (ARRAY_SIZE % unroll_length));
register int *end = array + (ARRAY_SIZE -1);
register int *curr_end;
curr_end = end;
while (unroll_end != curr_end) {
sum += *curr_end;
curr_end--;
}
do {
sum += *curr_end + *(curr_end-1) + *(curr_end-2) + *(curr_end-3) +
*(curr_end-4) + *(curr_end-5) + *(curr_end-6) + *(curr_end-7) +
*(curr_end-8) + *(curr_end-9) + *(curr_end-10) + *(curr_end-11) +
*(curr_end-12) + *(curr_end-13) + *(curr_end-14) + *(curr_end-15);
}
while ((curr_end -= unroll_length) != array);
sum += *curr_end;
Using these techniques, I was able to get the execution down to 5.5 seconds, which will give me full credit. However; I sure do want to earn the extra credit, but I'm also curious what additional optimizations I can make that I might be overlooking?
Edit #1 (Adding outer loop)
srand(time(NULL));
for(j = 0; j < ARRAY_SIZE; j++) {
x = rand() / (int)(((unsigned)RAND_MAX + 1) / 14);
array[j] = x;
checksum += x;
}
for (i = 0; i < N_TIMES; i++) {
// inner loop goes here
if (sum != checksum)
printf("Checksum error!\n");
sum = 0;
}

you could try to store your variables in CPU register with :
register int *unroll_limit = array + (ARRAY_SIZE - (ARRAY_SIZE % 10));
register int *end = array + ARRAY_SIZE;
register int *curr;
and try with different size of manual loops to check when you maximize cache usage.

I'm going to assume you're on x86, if you're not most of this will still apply but the details differ.
Use SIMD/SSE, this will get you a 4x speed increase without much effort, it needs 16-byte aligned data that you can get with _aligned_malloc or regular malloc + manual alignment. Besides that all you'll need in this case is _mm_add_epi32 to do four additions at the same time. (Different architectures have different SIMD units so check yours).
Use multi-threading/ multiple cores in this case it'd be easiest to have each thread sum half the array to a temporary variable and sum those two results when done. This will scale linearly across the number of cores available.
Prefetch to L1 cache; this only works when you've got a huge array and are sure to be able to stress the CPU for at least ~200 cycles (eg. a roundtrip to main RAM).
Completely go out of your way to optimize the hell out of it and use a GPU based approach. This will require you to set up a CUDA or OpenCL environment and upload the array to the GPU. This is about ~400 LoC excluding the compute kernel. But might not be worth it if you have a small dataset (eg. too much overhead in setting up/tearing down) or if you have a huge changing dataset (eg. too much time spend in streaming to the GPU).
Align to page boundaries to prevent page-faults (expensive) on windows these are usually 4K in size.
Manually unroll the loop while taking into account dual issuing instructions and instruction latencies. This information is available from your CPU manufacturer (Intel provides these too). But on x86 this isn't really useful because of it's CPUs out of order execution.
Depending on your platform actually getting the data to the CPU for processing is the slowest part (this is mainly true for recent consoles & PS, I've never developed for small embedded devices) so you'll want to optimize for that. Tricks like iterating backwards are nice on a 6502 when cycles were the bottleneck but these days you'll want to access RAM linearly.
If you do happen to be on a machine with fast RAM (eg. NOT PC/Consoles), converting from the plain array to a more fancy data-structure (eg. one that does more pointer chasing) might totally be worth it.
All in all, I guess that 1 & 2 are easiest and most feasible and will gain you more than enough performance (eg. 8x on a Core 2 Duo). However, it all comes down to knowing your hardware and programming PIC will require completely different optimizations (eg. instruction level manual pipelining) than a general PC will.

Try to align the array on a page boundary ( i.e. 4K )
Try to compute with a wider data type, i.e. 64 bit- instead of 32-bit integers. This way you can add 2 numbers at once. As the final step add up the both halves.
Convert part of the array or the computation to floating point, so you can use FPU and CPU in parallel
I don't expect the following suggestions to be allowed but I mention them anyway
Multithreading
Specialized CPU-Instructions, i.e. SSE

If the array values don't change, you could memoize the sum (i.e. calculate it on first run, and use the calculated sum on subsequent runs).

Some nice optimization tricks:
make your loop count backwards from ARRAY_SIZE to 0 so that way you can remove the comparisons from your code. Less comparisons speed up the program.
Furthermore x86 nowadays are optimized for short loops which they can "preload" to run faster then normal.
Try to use registers wherever possible
Use pointers instead of array indices
So if you would use arrays, try to use:
register int idx = ARRAY_SIZE - 1;
register int sum = 0;
do {
sum += array[idx];
} while (idx-- % 10 != 0);
do {
sum += array[idx] + array[idx - 1] + array[idx - 2] + array[idx - 3] + array[idx - 4] + array[idx - 5] + array[idx - 6] + array[idx - 7] + array[idx - 8] + array[idx - 9];
} while (idx -= 10);
// now we don't use a comparison and the ZERO flag will be set in FLAG
// register on which we can conditional jump. With a comparison you do VALUE - VALUE
// and then check if the ZERO flag is set or the NEGATIVE flag or whatever you are testing on

Related

Matrix multiplication in 2 different ways (comparing time)

I've got an assignment - compare 2 matrix multiplications - in the default way, and multiplication after transposition of second matrix, we should point the difference which method is faster. I've written something like this below, but time and time2 are nearly equal to each other. In one case the first method is faster, I run the multiplication with the same size of matrix, and in another one the second method is faster. Is something done wrong? Should I change something in my code?
clock_t start = clock();
int sum;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
sum = 0;
for(int k=0; k<size; ++k) {
sum = sum + (m1[i][k] * m2[k][j]);
}
score[i][j] = sum;
}
}
clock_t end = clock();
double time = (end-start)/(double)CLOCKS_PER_SEC;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
int temp = m2[i][j];
m2[i][j] = m2[j][i];
m2[j][i] = temp;
}
}
clock_t start2 = clock();
int sum2;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
sum2 = 0;
for(int k=0; k<size; ++k) {
sum2 = sum2 + (m1[k][i] * m2[k][j]);
}
score[i][j] = sum2;
}
}
clock_t end2 = clock();
double time2 = (end2-start2)/(double)CLOCKS_PER_SEC;
You have multiple severe issues with your code and/or your understanding. Let me try to explain.
Matrix multiplication is bottlenecked by the rate at which the processor can load and store the values to memory. Most current architectures use cache to help with this. Data is moved from memory to cache and from cache to memory in blocks. To maximize the benefit of caching, you want to make sure you will use all the data in that block. To do that, you make sure you access the data sequentially in memory.
In C, multi-dimensional arrays are specified in row-major order. It means that the rightmost index is consecutive in memory; i.e. that a[i][k] and a[i][k+1] are consecutive in memory.
Depending on the architecture, the time it takes for the processor to wait (and do nothing) for the data to be moved from RAM to cache (and vice versa), may or may not be included in the CPU time (that e.g. clock() measures, albeit at a very poor resolution). For this kind of measurement ("microbenchmark"), it is much better to measure and report both CPU and real (or wall clock) time used; especially so if the microbenchmark is run on different machines, to get a better idea of the practical impact of the change.
There will be a lot of variation, so typically, you measure the time taken by a few hundred repeats (each repeat possibly making more than one operation; enough to be easily measured), storing the duration of each, and report their median. Why median, and not minimum, maximum, average? Because there will always be occasional glitches (unreasonable measurement due to an external event, or something), which typically yield a much higher value than normal; this makes the maximum uninteresting, and skews the average (mean) unless removed. The minimum is typically an over-optimistic case, where everything just happened to go perfectly; that rarely occurs in practice, so is only a curiosity, not of practical interest. The median time, on the other hand, gives you a practical measurement: you can expect 50% of all runs of your test case to take no more than the median time measured.
On POSIXy systems (Linux, Mac, BSDs), you should use clock_gettime() to measure the time. The struct timespec format has nanosecond precision (1 second = 1,000,000,000 nanoseconds), but resolution may be smaller (i.e., the clocks change by more than 1 nanosecond, whenever they change). I personally use
#define _POSIX_C_SOURCE 200809L
#include <time.h>
static struct timespec cpu_start, wall_start;
double cpu_seconds, wall_seconds;
void timing_start(void)
{
clock_gettime(CLOCK_REALTIME, &wall_start);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &cpu_start);
}
void timing_stop(void)
{
struct timespec cpu_end, wall_end;
clock_gettime(CLOCK_REALTIME, &wall_end);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &cpu_end);
wall_seconds = (double)(wall_end.tv_sec - wall_start.tv_sec)
+ (double)(wall_end.tv_nsec - wall_start.tv_nsec) / 1000000000.0;
cpu_seconds = (double)(cpu_end.tv_sec - cpu_start.tv_sec)
+ (double)(cpu_end.tv_nsec - cpu_start.tv_nsec) / 1000000000.0;
}
You call timing_start() before the operation, and timing_stop() after the operation; then, cpu_seconds contains the amount of CPU time taken and wall_seconds the real wall clock time taken (both in seconds, use e.g. %.9f to print all meaningful decimals).
The above won't work on Windows, because Microsoft does not want your C code to be portable to other systems. It prefers to develop their own "standard" instead. (Those C11 "safe" _s() I/O function variants are a stupid sham, compared to e.g. POSIX getline(), or the state of wide character support on all systems except Windows.)
Matrix multiplication is
c[r][c] = a[r][0] * b[0][c]
+ a[r][1] * b[1][c]
: :
+ a[r][L] * b[L][c]
where a has L+1 columns, and b has L+1 rows.
In order to make the summation loop use consecutive elements, we need to transpose b. If B[c][r] = b[r][c], then
c[r][c] = a[r][0] * B[c][0]
+ a[r][1] * B[c][1]
: :
+ a[r][L] * B[c][L]
Note that it suffices that a and B are consecutive in memory, but separate (possibly "far" away from each other), for the processor to utilize cache efficiently in such cases.
OP uses a simple loop, similar to the following pseudocode, to transpose b:
For r in rows:
For c in columns:
temporary = b[r][c]
b[r][c] = b[c][r]
b[c][r] = temporary
End For
End For
The problem above is that each element participates in a swap twice. For example, if b has 10 rows and columns, r = 3, c = 5 swaps b[3][5] and b[5][3], but then later, r = 5, c = 3 swaps b[5][3] and b[3][5] again! Essentially, the double loop ends up restoring the matrix to the original order; it does not do a transpose.
Consider the following entries and the actual transpose:
b[0][0] b[0][1] b[0][2] b[0][0] b[1][0] b[2][0]
b[1][0] b[1][1] b[1][2] ⇔ b[0][1] b[1][1] b[2][1]
b[2][0] b[2][1] b[2][2] b[0][2] b[1][2] b[2][2]
The diagonal entries are not swapped. You only need to do the swap in the upper triangular portion (where c > r) or in the lower triangular portion (where r > c), to swap all entries, because each swap swaps one entry from the upper triangular to the lower triangular, and vice versa.
So, to recap:
Is something done wrong?
Yes. Your transpose does nothing. You haven't understood the reason why one would want to transpose the second matrix. Your time measurement relies on a low-precision CPU time, which may not reflect the time taken by moving data between RAM and CPU cache. In the second test case, with m2 "transposed" (except it isn't, because you swap each element pair twice, returning them back to the way they were), your innermost loop is over the leftmost array index, which means it calculates the wrong result. (Moreover, because consecutive iterations of the innermost loop accesses items far from each other in memory, it is anti-optimized: it uses the pattern that is worst in terms of speed.)
All the above may sound harsh, but it isn't intended to be, at all. I do not know you, and I am not trying to evaluate you; I am only pointing out the errors in this particular answer, in your current understanding, and only in the hopes that it helps you, and anyone else encountering this question in similar circumstances, to learn.

Switch with tons of cases vs single if

Let's say we need to go through a 128-element array of uint8 and compare neighbour elements and put the result to another array. The code below is the most simple and readable way to solve this problem.
for (i = 1; i < 128; i++)
if (arr[i] < arr[i-1] + 64) //don't care about overflow
arr2[i] = 1;
It looks like this code 1) will not use branch table.
And as far as I know, a cpu doesn't read just 1 byte, it actually reads 8 bytes (assuming a 64bit machine), and that 2) makes cpu do some extra work.
So here comes another approach. Read 2 (or 4 or 8) bytes at a time and create an extremely huge switch (2^16, 2^32 or 2^64 cases respectively), which has every possible combination of bytes in our array. Does this make any sense?
For this discussion let's assume the following:
1) Our main priority is speed
2) Next is RAM consumption.
We don't care about the size of the executable (unless they somehow affect speed or RAM)
You should know that switches are actually very slow as branch would be likely mispredicted. What makes switch fast is jump table:
switch (i) {
case 0: ...
case 1: ...
}
gets translated into this:
labels = {&case0, &case1}
goto labels[i]
However, you do not need this either as your only writing memory cell and you can write a "jump table", or more specifically pre-computed matrix of answers yourself:
for (i = 1; i < 128; i++)
arr2[i] = answers[arr[i]][arr[i-1]];
uint8 have only 256 possible values which gives us 64k of RAM required for that matrix.

Segmentation fault when trying to use intrinsics specifically _mm256_storeu_pd()

Seemed to have fixed it myself by type casting the cij2 pointer inside the mm256 call
so _mm256_storeu_pd((double *)cij2,vecC);
I have no idea why this changed anything...
I'm writing some code and trying to take advantage of the Intel manual vectorization. But whenever I run the code I get a segmentation fault on trying to use my double *cij2.
if( q == 0)
{
__m256d vecA;
__m256d vecB;
__m256d vecC;
for (int i = 0; i < M; ++i)
for (int j = 0; j < N; ++j)
{
double cij = C[i+j*lda];
double *cij2 = (double *)malloc(4*sizeof(double));
for (int k = 0; k < K; k+=4)
{
vecA = _mm256_load_pd(&A[i+k*lda]);
vecB = _mm256_load_pd(&B[k+j*lda]);
vecC = _mm256_mul_pd(vecA,vecB);
_mm256_storeu_pd(cij2, vecC);
for (int x = 0; x < 4; x++)
{
cij += cij2[x];
}
}
C[i+j*lda] = cij;
}
I've pinpointed the problem to the cij2 pointer. If i comment out the 2 lines that include that pointer the code runs fine, it doesn't work like it should but it'll actually run.
My question is why would i get a segmentation fault here? I know I've allocated the memory correctly and that the memory is a 256 vector of double's with size 64 bits.
After reading the comments I've come to add some clarification.
First thing I did was change the _mm_malloc to just a normal allocation using malloc. Shouldn't affect either way but will give me some more breathing room theoretically.
Second the problem isn't coming from a null return on the allocation, I added a couple loops in to increment through the array and make sure I could modify the memory without it crashing so I'm relatively sure that isn't the problem. The problem seems to stem from the loading of the data from vecC to the array.
Lastly I can not use BLAS calls. This is for a parallelisms class. I know it would be much simpler to call on something way smarter than I but unfortunately I'll get a 0 if I try that.
You dynamically allocate double *cij2 = (double *)malloc(4*sizeof(double)); but you never free it. This is just silly. Use double cij2[4], especially if you're not going to bother to align it. You never need more than one scratch buffer at once, and it's a small fixed size, so just use automatic storage.
In C++11, you'd use alignas(32) double cij2[4] so you could use _mm256_store_pd instead of storeu. (Or just to make sure storeu isn't slowed down by an unaligned address).
If you actually want to debug your original, use a debugger to catch it when it segfaults, and look at the pointer value. Make sure it's something sensible.
Your methods for testing that the memory was valid (like looping over it, or commenting stuff out) sound like they could lead to a lot of your loop being optimized away, so the problem wouldn't happen.
When your program crashes, you can also look at the asm instructions. Vector intrinsics map fairly directly to x86 asm (except when the compiler sees a more efficient way).
Your implementation would suck a lot less if you pulled the horizontal sum out of the loop over k. Instead of storing each multiply result and horizontally adding it, use a vector add into a vector accumulator. hsum it outside the loop over k.
__m256d cij_vec = _mm256_setzero_pd();
for (int k = 0; k < K; k+=4) {
vecA = _mm256_load_pd(&A[i+k*lda]);
vecB = _mm256_load_pd(&B[k+j*lda]);
vecC = _mm256_mul_pd(vecA,vecB);
cij_vec = _mm256_add_pd(cij_vec, vecC); // TODO: use multiple accumulators to keep multiple VADDPD or VFMAPD instructions in flight.
}
C[i+j*lda] = hsum256_pd(cij_vec); // put the horizontal sum in an inline function
For good hsum256_pd implementations (other than storing to memory and using a scalar loop), see Fastest way to do horizontal float vector sum on x86 (I included an AVX version there. It should be easy to adapt the pattern of shuffling to 256b double-precision.) This will help your code a lot, since you still have O(N^2) horizontal sums (but not O(N^3) with this change).
Ideally you could accumulate results for 4 i values in parallel, and not need horizontal sums.
VADDPD has a latency of 3 to 4 clocks, and a throughput of one per 1 to 0.5 clocks, so you need from 3 to 8 vector accumulators to saturate the execution units. Or with FMA, up to 10 vector accumulators (e.g. on Haswell where FMA...PD has 5c latency and one per 0.5c throughput). See Agner Fog's instruction tables and optimization guides to learn more about that. Also the x86 tag wiki.
Also, ideally nest your loops in a way that gave you contiguous access to two of your three arrays, since cache access patterns are critical for matmul (lots of data reuse). Even if you don't get fancy and transpose small blocks at a time that fit in cache. Even transposing one of your input matrices can be a win, since that costs O(N^2) and speeds up the O(N^3) process. I see your inner loop currently has a stride of lda while accessing A[].

How to use AVX/SIMD with nested loops and += format?

I am writing a page rank program. I am writing a method for updating the rankings. I have successful got it working with nested for loops and also a threaded version. However I would like to instead use SIMD/AVX.
This is the code I would like to change into a SIMD/AVX implementation.
#define IDX(a, b) ((a * npages) + b) // 2D matrix indexing
for (size_t i = 0; i < npages; i++) {
temp[i] = 0.0;
for (size_t j = 0; j < npages; j++) {
temp[i] += P[j] * matrix_cap[IDX(i,j)];
}
}
For this code P[] is of size npages and matrix_cap[] is of size npages * npages. P[] is the ranks of the pages and temp[] is used to store the next iterations page ranks so as to be able to check convergence.
I don't know how to interpret += with AVX and how I would get my data which involves two arrays/vectors of size npages and one matrix of size npages * npages (in row major order) into a format of which could be used with SIMD/AVX operations.
As far as AVX this is what I have so far though it's very very incorrect and was just a stab at what I would roughly like to do.
ssize_t g_mod = npages - (npages % 4);
double* res = malloc(sizeof(double) * npages);
double sum = 0.0;
for (size_t i = 0; i < npages; i++) {
for (size_t j = 0; j < mod; j += 4) {
__m256d p = _mm256_loadu_pd(P + j);
__m256d m = _mm256_loadu_pd(matrix_hat + i + j);
__m256d pm = _mm256_mul_pd(p, m);
_mm256_storeu_pd(&res + j, pm);
for (size_t k = 0; k < 4; k++) {
sum += res[j + k];
}
}
for (size_t i = mod; i < npages; i++) {
for (size_t j = 0; j < npages; j++) {
sum += P[j] * matrix_cap[IDX(i,j)];
}
}
temp[i] = sum;
sum = 0.0;
}
How to can I format my data so I can use AVX/SIMD operations (add,mul) on it to optimise it as it will be called a lot.
Consider using OpenMP4.x #pragma omp simd reduction for innermost loop. Take in mind that omp reductions are not applicable to C++ arrays, therefore you have to use temporary reduction variable like shown below.
#define IDX(a, b) ((a * npages) + b) // 2D matrix indexing
for (size_t i = 0; i < npages; i++) {
my_type tmp_reduction = 0.0; // was: // temp[i] = 0.0;
#pragma omp simd reduction (+:tmp_reduction)
for (size_t j = 0; j < npages; j++) {
tmp_reduction += P[j] * matrix_cap[IDX(i,j)];
}
temp[i] = tmp_reduction;
}
For x86 platforms, OpenMP4.x is currently supported by fresh GCC (4.9+) and Intel Compilers. Some LLVM and PGI compilers may also support it.
P.S. Auto-vectorization ("auto" means vectorization by compiler without any pragmas, i.e. without explicit gudiance from developers) may sometimes work for some compiler variants (although it's very unlikely due to array element as reduction variable). However it is strictly speaking incorrect to auto-vectorize this code. You have to use explicit SIMD pragma to "resolve" reduction dependency and (as a good side-effect) disambiguate pointers (in case arrays are accessed via pointer).
First, EOF is right, you should see how well gcc/clang/icc do at auto-vectorizing your scalar code. I can't check for you, because you only posted code-fragments, not anything I can throw on http://gcc.godbolt.org/.
You definitely don't need to malloc anything. Notice that your intrinsics version only ever uses 32B at a time of res[], and always overwrites whatever was there before. So you might as well use a single 32B array. Or better, use a better method to get a horizontal sum of your vector.
(see the bottom for a suggestion on a different data arrangement for the matrix)
Calculating each temp[i] uses every P[j], so there is actually something to be gained from being smarter about vectorizing. For every load from P[j], use that vector with 4 different loads from matrix_cap[] for that j, but 4 different i values. You'll accumulate 4 different vectors, and have to hsum each of them down to a temp[i] value at the end.
So your inner loop will have 5 read streams (P[] and 4 different rows of matrix_cap). It will do 4 horizontal sums, and 4 scalar stores at the end, with the final result for 4 consecutive i values. (Or maybe do two shuffles and two 16B stores). (Or maybe transpose-and-sum together, which is actually a good use-case for the shuffling power of the expensive _mm256_hadd_pd (vhaddpd) instruction, but be careful of its in-lane operation)
It's probably even better to accumulate 8 to 12 temp[i] values in parallel, so every load from P[j] is reused 8 to 12 times. (check the compiler output to make sure you aren't running out of vector regs and spilling __m256d vectors to memory, though.) This will leave more work for the cleanup loop.
FMA throughput and latency are such that you need 10 vector accumulators to keep 10 FMAs in flight to saturate the FMA unit on Haswell. Skylake reduced the latency to 4c, so you only need 8 vector accumulators to saturate it on SKL. (See the x86 tag wiki). Even if you're bottlenecked on memory, not execution-port throughput, you will want multiple accumulators, but they could all be for the same temp[i] (so you'd vertically sum them down to one vector, then hsum that).
However, accumulating results for multiple temp[i] at once has the large advantage of reusing P[j] multiple times after loading it. You also save the vertical adds at the end. Multiple read streams may actually help hide the latency of a cache miss in any one of the streams. (HW prefetchers in Intel CPUs can track one forward and one reverse stream per 4k page, IIRC). You might strike a balance, and use two or three vector accumulators for each of 4 temp[i] results in parallel, if you find that multiple read streams are a problem, but that would mean you'd have to load the same P[j] more times total.
So you should do something like
#define IDX(a, b) ((a * npages) + b) // 2D matrix indexing
for (size_t i = 0; i < (npages & (~7ULL)); i+=8) {
__m256d s0 = _mm256_setzero_pd(),
s1 = _mm256_setzero_pd(),
s2 = _mm256_setzero_pd(),
...
s7 = _mm256_setzero_pd(); // 8 accumulators for 8 i values
for (size_t j = 0; j < (npages & ~(3ULL)); j+=4) {
__m256d Pj = _mm256_loadu_pd(P+j); // reused 8 times after loading
//temp[i] += P[j] * matrix_cap[IDX(i,j)];
s0 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+0,j)]), s0);
s1 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+1,j)]), s1);
// ...
s7 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+7,j)]), s7);
}
// or do this block with a hsum+transpose and do vector stores.
// taking advantage of the power of vhaddpd to be doing 4 useful hsums with each instructions.
temp[i+0] = hsum_pd256(s0); // See the horizontal-sum link earlier for how to write this function
temp[i+1] = hsum_pd256(s1);
//...
temp[i+7] = hsum_pd256(s7);
// if npages isn't a multiple of 4, add the last couple scalar elements to the results of the hsum_pd256()s.
}
// TODO: cleanup for the last up-to-7 odd elements.
You could probably write __m256d sums[8] and loop over your vector accumulators, but you'd have to check that the compiler fully unrolls it and still actually keeps everything live in registers.
How to can I format my data so I can use AVX/SIMD operations (add,mul) on it to optimise it as it will be called a lot.
I missed this part of the question earlier. First of all, obviously float will and give you 2x the number of elements per vector (and per unit of memory bandwidth). The factor of 2 less memory / cache footprint might give more speedup than that if cache hit rate increases.
Ideally, the matrix would be "striped" to match the vector width. Every load from the matrix would get a vector of matrix_cap[IDX(i,j)] for 4 adjacent i values, but the next 32B would be the next j value for the same 4 i values. This means that each vector accumulator is accumulating the sum for a different i in each element, so no need for horizontal sums at the end.
P[j] stays linear, but you broadcast-load each element of it, for use with 8 vectors of 4 i values each (or 8 vec of 8 is for float). So you increase your reuse factor for P[j] loads by a factor of the vector width. Broadcast-loads are near-free on Haswell and later (still only take a load-port uop), and plenty cheap for this on SnB/IvB where they also take a shuffle-port uop.

algorithm comparison in C, what's the difference?

#define IMGX 8192
#define IMGY 8192
int red_freq[256];
char img[IMGY][IMGX][3];
main(){
int i, j;
long long total;
long long redness;
for (i = 0; i < 256; i++)
red_freq[i] = 0;
for (i = 0; i < IMGY; i++)
for (j = 0; j < IMGX; j++)
red_freq[img[i][j][0]] += 1;
total = 0;
for (i = 0; i < 256; i++)
total += (long long)i * (long long)red_freq[i];
redness = (total + (IMGX*IMGY/2))/(IMGX*IMGY);
what's the difference when you replace the second for loop into
for (j = 0; j < IMGX; j++)
for (i = 0; i < IMGY; i++)
red_freq[img[i][j][0]] += 1;
everything else are stay the same and why the first algorithm is faster than then second algorithm ?
Does it have something to do with the memory allocation?
The first version alters memory in sequence, so uses the processor cache optimally.
The second version uses one value from each cache line it loads, so it pessimal for cache use.
The point to understand is that the cache is divided into lines, each of which will contain many values in the overall structure.
The first version might also be optimized by the compiler to use more clever instructions (SIMD instructions) which would be even faster.
It is because the first version is iterating through the memory in the order that it is physically laid out, while the second one is jumping around in memory from one column in the array to the next. This will cause cache thrashing and interfere with the optimal performance of the CPU, which then has to spend lots of time waiting for the cache to be refreshed over and over again.
It's because big modern processor architectures (like the one in a PC) are massively optimised to work on memory which is 'near' (in address-related terms) memory which they've recently accessed. Actual physical memory access is much, much slower than the CPU can theoretically run, so everything which helps the process do its access in the most efficient fashion helps with performance.
It's pretty much impossibly to generalise more than that, but 'locality of reference' is a good thing to aim for.
Due to how the memory is laid out the first version maintains data locality and therefore causes less cache misses.
memory allocation happens only once and it is at the beginning so it can not be the reason. the reason is how the runtime calculates the address. In both cases memory address is calculated as
(i * (IMGY * IMGX)) + (j * IMGX) + 0
In the first algorithm
(i * (IMGY * IMGX)) gets calculates 8192 times
(j * IMGX) gets calculated 8192 * 8192 times
In the second algorithm
(i * (IMGY * IMGX)) gets calculates 8192 * 8192 times
(j * IMGX) gets calculated 8192 times
Since
(i * (IMGY * IMGX))
involves two multiplications, doing it more takes more time. that is the reason
Yes it has something to do with memory allocation. The first loop indexes the inner dimension of img, which happens to span over only 3 bytes each time. That's within one memory page easily (i believe a common size here is 4kB for one page). But with your second version, the outer dimension's index changes fast. That will cause memory reads spread over a much larger range of memory - namely sizeof (char[IMGX][3]) bytes, which is 24kB. And with each change of the inner index, those jumps start to happen again. That will hit different pages and is probably somewhat slower. Also i heard the CPU reads ahead memory. That will make the first version benefit, because at the time it reads, that data is probably already in the cache. I can imagine the second version doesn't benefit from that, because it makes those large jumps around the memory back and forth.
I would suspect the difference is not that much, but if the algorithm runs many times, it eventually becomes noticeable. You probably want to read the article Row-major Order on wikipedia. That is the scheme used to store multi-dimensional arrays in C.

Resources