Tiled Matrix Multiplication using AVX - c

I have coded the following C function for multiplying two NxN matrices using tiling/blocking and AVX vectors to speed up the calculation. Right now though I'm getting a segmentation fault when I try to combine AVX intrinsics with tiling. Any idea why that happens?
Also, is there a better memory access pattern for matrix B? Maybe transposing it first or even changing the k and j loop? Because right now, I'm traversing it column-wise which is probably not very efficient in regards to spatial locality and cache lines.
1 void mmult(double A[SIZE_M][SIZE_N], double B[SIZE_N][SIZE_K], double C[SIZE_M][SIZE_K])
2 {
3 int i, j, k, i0, j0, k0;
4 // double sum;
5 __m256d sum;
6 for(i0 = 0; i0 < SIZE_M; i0 += BLOCKSIZE) {
7 for(k0 = 0; k0 < SIZE_N; k0 += BLOCKSIZE) {
8 for(j0 = 0; j0 < SIZE_K; j0 += BLOCKSIZE) {
9 for (i = i0; i < MIN(i0+BLOCKSIZE, SIZE_M); i++) {
10 for (j = j0; j < MIN(j0+BLOCKSIZE, SIZE_K); j++) {
11 // sum = C[i][j];
12 sum = _mm256_load_pd(&C[i][j]);
13 for (k = k0; k < MIN(k0+BLOCKSIZE, SIZE_N); k++) {
14 // sum += A[i][k] * B[k][j];
15 sum = _mm256_add_pd(sum, _mm256_mul_pd(_mm256_load_pd(&A[i][k]), _mm256_broadcast_sd(&B[k][j])));
16 }
17 // C[i][j] = sum;
18 _mm256_store_pd(&C[i][j], sum);
19 }
20 }
21 }
22 }
23 }
24 }

_mm256_load_pd is an alignment-required load but you're only stepping by k++, not k+=4 in the inner-most loop that loads a 32-byte vector of 4 doubles. So it faults because 3 of every 4 loads are misaligned.
You don't want to be doing overlapping loads, your real bug is the indexing; if your input pointers are 32-byte aligned you should be able to keep using _mm256_load_pd instead of _mm256_loadu_pd. So using _mm256_load_pd successfully caught your bug instead of working but giving numerically wrong results.
Your strategy for vectorizing four row*column dot products (to produce a C[i][j+0..3] vector) should load 4 contiguous doubles from 4 different columns (B[k][j+0..3] via a vector load from B[k][j]), and broadcast 1 double from A[i][k]. Remember you want 4 dot products in parallel.
Another strategy might involve a horizontal sum at the end down to a scalar C[i][j] += horizontal_add(__m256d), but I think that would require transposing one input first so both row and column vectors are in contiguous memory for one dot product. But then you need shuffles for a horizontal sum at the end of each inner loop.
You probably also want to use at least 2 sum variables so you can read a whole cache line at once, and hide FMA latency in the inner loop and hopefully bottleneck on throughput. Or better do 4 or 8 vectors in parallel. So you produce C[i][j+0..15] as sum0, sum1, sum2, sum3. (Or use an array of __m256d; compilers will typically fully unroll a loop of 8 and optimize the array into registers.)
I think you only need 5 nested loops, to block over rows and columns. Although apparently 6 nested loops are a valid option: see loop tiling/blocking for large dense matrix multiplication which has a 5-nested loop in the question but a 6-nested loop in an answer. (Just scalar, though, not vectorized).
There might be other bugs besides the row*column dot product strategy here, I'm not sure.
If you're using AVX, you might want to use FMA as well, unless you need to run on Sandbybridge/Ivybridge, and AMD Bulldozer. (Piledriver and later have FMA3).
Other matmul strategies include adding into the destination inside the inner loop so you're loading C and A inside the inner loop, with a load from B hoisted. (Or B and A swapped, I forget.) What Every Programmer Should Know About Memory? has a vectorized cache-blocked example that works this way in an appendix, for SSE2 __m128d vectors. https://www.akkadia.org/drepper/cpumemory.pdf

Related

If C is row-major order, why does ARM intrinsic code assume column-major order?

im not sure where is the best place to ask this but I am currently working on using ARM intrinsics and am following this guide: https://developer.arm.com/documentation/102467/0100/Matrix-multiplication-example
However, the code there was written assuming that the arrays are stored column-major order. I have always thought C arrays were stored row-major. Why did they assume this?
EDIT:
For example, if instead of this:
void matrix_multiply_c(float32_t *A, float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) {
for (int i_idx=0; i_idx < n; i_idx++) {
for (int j_idx=0; j_idx < m; j_idx++) {
for (int k_idx=0; k_idx < k; k_idx++) {
C[n*j_idx + i_idx] += A[n*k_idx + i_idx]*B[k*j_idx + k_idx];
}
}
}
}
They had done this:
void matrix_multiply_c(float32_t *A, float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) {
for (int i_idx=0; i_idx < n; i_idx++) {
for (int k_idx=0; k_idx < k; k_idx++) {
for (int j_idx=0; j_idx < m; j_idx++) {
C[n*j_idx + i_idx] += A[n*k_idx + i_idx]*B[k*j_idx + k_idx];
}
}
}
}
The code would run faster due to spatial locality of accessing C in the order C[0], C[1], C[2], C[3] instead of in the order C[0], C[2], C[1], C[3] (where C[0], C[1], C[2], C[3] are contiguous in memory).
You're not using C 2D arrays like C[i][j], so it's not a matter of how C stores anything, it's how 2D indexing is done manually in this code, using n * idx_1 + idx_2, with a choice of which you loop over in the inner vs. outer loops.
But the hard part of a matmul with both matrices non-transposed is that you need to make opposite choices for the two input matrices: a naive matmul has to stride through distant elements of one of the input matrices, so it's inherently screwed. That's a major part of why careful cache-blocking / loop-tiling is important for matrix multiplication. (O(n^3) work over O(n^2) data - you want to get the most use out of it for every time you bring it into L1d cache, and/or into registers.)
Loop interchange can speed things up to take advantage of spatial locality in the inner-most loop, if you do it right.
See the cache-blocked matmul example in What Every Programmer Should Know About Memory? which traverses contiguous memory in all 3 inputs in the inner few loops, picking the index that isn't scaled in any of the 3 matrices as the inner one. That looks like this:
for (j_idx)
for (k_idx)
for (i_idx)
C[n*j_idx + i_idx] += A[n*k_idx + i_idx]*B[k*j_idx + k_idx];
Notice that B[k * j_idx + k_idx] is invariant over the loop inner loop, and that you're doing a simple dst[0..n] += const * src[0..n] operation over contiguous memory (which is easy to SIMD vectorize), although you're still doing 2 loads + 1 store for every FMA, so that's not going to max out your FP throughput.
Separate from the cache access pattern, that also avoids a long dependency chain into a single accumulator (element of C). But that's not a real problem for an optimized implementation: you can of course use multiple accumulators. FP math isn't strictly associative because of rounding error, but multiple accumulators are closer to pairwise summation and likely to have less bad FP rounding error than serially adding each element of the row x column dot product.
It will have different results to adding in the order standard simple C loop does, but usually closer to the exact answer.
Your proposed loop order i,k,j is the worst possible.
You're striding through distant elements of 2 of the 3 matrices in the inner loop, including discontiguous access to C[], opposite of what you said in your last paragraph.
With j as the inner-most loop, you'd access C[0], C[n], C[2n], etc. on the first outer iteration. And same for B[], so that's really bad.
Interchanging the i and j loops would give you contiguous access to C[] in the middle loop instead of strided, and still rows of one, columns of the other, in the inner-most loop. So that would be strictly an improvement: yes you're right that this naive example is constructed even worse than it needs to be.
But the key issue is the strided access to something in the inner loop: that's a performance disaster; that's a major part of why careful cache-blocking / loop-tiling is important for matrix multiplication. The only index that is never used with a scale factor is i.
C is not inherently row-major or column-major.
When writing a[i][j], it's up to you to decide whether i is a row index or a column index.
While it's somewhat of a common convention to write the row index first (making the arrays row-major), nothing stops you from doing the opposite.
Also, remember that A × B = C is equivalent to Bt × At = Ct (t meaning a transposed matrix), and reading a row-major matrix as if it was column-major (or vice versa) transposes it, meaning that if you want to keep your matrices row-major, you can just reverse the order of the operands.

How to do a proper Cache Blocked Matrix Transposition?

I am trying to do a Cache Blocked Matrix Transposition in C but I am having some troubles with something in my code. My guess is that it has to do with the indexes. Can you tell me where am I going wrong?
I am considering this both algorithm I found on the web: http://users.cecs.anu.edu.au/~Alistair.Rendell/papers/coa.pdf and http://iosrjen.org/Papers/vol3_issue11%20(part-4)/I031145055.pdf
But I couldn't figure it out yet how to correctly code those.
for (i = 0; i < N; i += block) {
for (j = 0; j < i; j += block ) {
for(ii=i;ii<i+block;ii++){
for(jj=j;jj<j+block;jj++){
temp1[ii][jj] = A2[ii][jj];
temp2[ii][jj] = A2[jj][ii];
A2[ii][jj] = temp1[ii][jj];
A2[ii][jj] = temp2[ii][jj];
}
}
}
}
temp1 and temp2 are two matrices of size block x block filled with zeros.
I am not sure if I have to do another for when I am returning the values to A2 (the before and after transposed matrix).
I also tried this:
for (i = 0; i < N; i += block) {
for (j = 0; j < N; j += block ) {
ii = A2[i][j];
jj = A2[j][i];
A2[j][i] = ii;
A2[i][j] = jj;
}
}
I am expecting better performance than the "naive" Matrix Transposition algorithm:
for (i = 1; i < N; i++) {
for(j = 0; j < i; j++) {
TEMP= A[i][j];
A[i][j]=A[j][i];
A[j][i]=TEMP;
}
}
The proper way to do blocked matrix transposition is not what is in your program. The extra temp1 and temp2 array will uselessly fill you cache. And your second version is incorrect. More you do too much operations: elements are transposed twice and diagonal elements are "tranposed".
But first we can do some simple (and approximate) cache behavior analysis. I assume that you have a matrix of doubles and that cache lines are 64 bytes (8 doubles).
A blocked implementation is equivalent to a naive implementation if the cache can completely contain the matrix. You only have mandatory cache misses to fetch the matrix elements. The number of cache misses will be N×N/8 to process N×N elements, with an average number of misses of 1/8 per element.
Now, for the naive implementation, look at the situation after you have processed 1 line in the cache. Assuming you cache is large enough, you will have in your cache :
* the complete line A[0][i]
* the first 8 elements of every other lines of the matrix A[i][0..7]
This means that, if you cache is large enough, you can process the 7 successive lines without any cache miss other than the one to fetch the lines. So if your matrix is N×N, if cache size is larger than ~2×N×8, you will have only 8×N/8(lines)+N(cols)=2N cache misses to process 8×N elements, and the average number of misses per element is 1/4. Numerically, if L1 cache size is 32k, this will happen if N<2k. And if L2 cache is 256k, data will remain in cache L2 if N<16k. I do not think the difference between data in L1 and data in L2 will be really visible, thanks to the very efficient prefetch in modern processors.
If you have a very large matrix, after the end of first line, the beginning of the second line has been ejected from cache. This will happen if a line of your matrix completely fills the cache. In this situation, the number of cache misses will be much more important. Every line will have N/8 (to get the line) + N (to get the first elements of columns) cache misses, and there is an average on (9×N/8)/N&approx;1 miss per element.
So you can gain with a blocked implementation, but only for large matrices.
Here is a correct implementation of matrix transpose. It avoids a dual processing of element A[l][m] (when i=l and j=m or i=m and j=l), do not transpose diagonal elements and uses registers for the transposition.
Naive version
for (i=0;i<N;i++)
for (j=i+1;j<N;j++)
{
temp=A[i][j];
A[i][j]=A[j][i];
A[j][i]=temp;
}
And the blocked version (we assume the matrix size is a multiple of block size)
for (ii=0;ii<N;ii+=block)
for (jj=0;jj<N;jj+=block)
for (i=ii;i<ii+block;i++)
for (j=jj+i+1;j<jj+block;j++)
{
temp=A[i][j];
A[i][j]=A[j][i];
A[j][i]=temp;
}
I am using your code but I am not getting the same answer when I compare the naive with the blocked algorithm. I put this matrix A and I am getting the matrix At as follows:
A
2 8 1 8
6 8 2 4
7 2 6 5
6 8 6 5
At
2 6 1 6
8 8 2 4
7 2 6 5
8 8 6 5
with a matrix of size N=4 and block= 2

Matrix multiplication in 2 different ways (comparing time)

I've got an assignment - compare 2 matrix multiplications - in the default way, and multiplication after transposition of second matrix, we should point the difference which method is faster. I've written something like this below, but time and time2 are nearly equal to each other. In one case the first method is faster, I run the multiplication with the same size of matrix, and in another one the second method is faster. Is something done wrong? Should I change something in my code?
clock_t start = clock();
int sum;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
sum = 0;
for(int k=0; k<size; ++k) {
sum = sum + (m1[i][k] * m2[k][j]);
}
score[i][j] = sum;
}
}
clock_t end = clock();
double time = (end-start)/(double)CLOCKS_PER_SEC;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
int temp = m2[i][j];
m2[i][j] = m2[j][i];
m2[j][i] = temp;
}
}
clock_t start2 = clock();
int sum2;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
sum2 = 0;
for(int k=0; k<size; ++k) {
sum2 = sum2 + (m1[k][i] * m2[k][j]);
}
score[i][j] = sum2;
}
}
clock_t end2 = clock();
double time2 = (end2-start2)/(double)CLOCKS_PER_SEC;
You have multiple severe issues with your code and/or your understanding. Let me try to explain.
Matrix multiplication is bottlenecked by the rate at which the processor can load and store the values to memory. Most current architectures use cache to help with this. Data is moved from memory to cache and from cache to memory in blocks. To maximize the benefit of caching, you want to make sure you will use all the data in that block. To do that, you make sure you access the data sequentially in memory.
In C, multi-dimensional arrays are specified in row-major order. It means that the rightmost index is consecutive in memory; i.e. that a[i][k] and a[i][k+1] are consecutive in memory.
Depending on the architecture, the time it takes for the processor to wait (and do nothing) for the data to be moved from RAM to cache (and vice versa), may or may not be included in the CPU time (that e.g. clock() measures, albeit at a very poor resolution). For this kind of measurement ("microbenchmark"), it is much better to measure and report both CPU and real (or wall clock) time used; especially so if the microbenchmark is run on different machines, to get a better idea of the practical impact of the change.
There will be a lot of variation, so typically, you measure the time taken by a few hundred repeats (each repeat possibly making more than one operation; enough to be easily measured), storing the duration of each, and report their median. Why median, and not minimum, maximum, average? Because there will always be occasional glitches (unreasonable measurement due to an external event, or something), which typically yield a much higher value than normal; this makes the maximum uninteresting, and skews the average (mean) unless removed. The minimum is typically an over-optimistic case, where everything just happened to go perfectly; that rarely occurs in practice, so is only a curiosity, not of practical interest. The median time, on the other hand, gives you a practical measurement: you can expect 50% of all runs of your test case to take no more than the median time measured.
On POSIXy systems (Linux, Mac, BSDs), you should use clock_gettime() to measure the time. The struct timespec format has nanosecond precision (1 second = 1,000,000,000 nanoseconds), but resolution may be smaller (i.e., the clocks change by more than 1 nanosecond, whenever they change). I personally use
#define _POSIX_C_SOURCE 200809L
#include <time.h>
static struct timespec cpu_start, wall_start;
double cpu_seconds, wall_seconds;
void timing_start(void)
{
clock_gettime(CLOCK_REALTIME, &wall_start);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &cpu_start);
}
void timing_stop(void)
{
struct timespec cpu_end, wall_end;
clock_gettime(CLOCK_REALTIME, &wall_end);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &cpu_end);
wall_seconds = (double)(wall_end.tv_sec - wall_start.tv_sec)
+ (double)(wall_end.tv_nsec - wall_start.tv_nsec) / 1000000000.0;
cpu_seconds = (double)(cpu_end.tv_sec - cpu_start.tv_sec)
+ (double)(cpu_end.tv_nsec - cpu_start.tv_nsec) / 1000000000.0;
}
You call timing_start() before the operation, and timing_stop() after the operation; then, cpu_seconds contains the amount of CPU time taken and wall_seconds the real wall clock time taken (both in seconds, use e.g. %.9f to print all meaningful decimals).
The above won't work on Windows, because Microsoft does not want your C code to be portable to other systems. It prefers to develop their own "standard" instead. (Those C11 "safe" _s() I/O function variants are a stupid sham, compared to e.g. POSIX getline(), or the state of wide character support on all systems except Windows.)
Matrix multiplication is
c[r][c] = a[r][0] * b[0][c]
+ a[r][1] * b[1][c]
: :
+ a[r][L] * b[L][c]
where a has L+1 columns, and b has L+1 rows.
In order to make the summation loop use consecutive elements, we need to transpose b. If B[c][r] = b[r][c], then
c[r][c] = a[r][0] * B[c][0]
+ a[r][1] * B[c][1]
: :
+ a[r][L] * B[c][L]
Note that it suffices that a and B are consecutive in memory, but separate (possibly "far" away from each other), for the processor to utilize cache efficiently in such cases.
OP uses a simple loop, similar to the following pseudocode, to transpose b:
For r in rows:
For c in columns:
temporary = b[r][c]
b[r][c] = b[c][r]
b[c][r] = temporary
End For
End For
The problem above is that each element participates in a swap twice. For example, if b has 10 rows and columns, r = 3, c = 5 swaps b[3][5] and b[5][3], but then later, r = 5, c = 3 swaps b[5][3] and b[3][5] again! Essentially, the double loop ends up restoring the matrix to the original order; it does not do a transpose.
Consider the following entries and the actual transpose:
b[0][0] b[0][1] b[0][2] b[0][0] b[1][0] b[2][0]
b[1][0] b[1][1] b[1][2] ⇔ b[0][1] b[1][1] b[2][1]
b[2][0] b[2][1] b[2][2] b[0][2] b[1][2] b[2][2]
The diagonal entries are not swapped. You only need to do the swap in the upper triangular portion (where c > r) or in the lower triangular portion (where r > c), to swap all entries, because each swap swaps one entry from the upper triangular to the lower triangular, and vice versa.
So, to recap:
Is something done wrong?
Yes. Your transpose does nothing. You haven't understood the reason why one would want to transpose the second matrix. Your time measurement relies on a low-precision CPU time, which may not reflect the time taken by moving data between RAM and CPU cache. In the second test case, with m2 "transposed" (except it isn't, because you swap each element pair twice, returning them back to the way they were), your innermost loop is over the leftmost array index, which means it calculates the wrong result. (Moreover, because consecutive iterations of the innermost loop accesses items far from each other in memory, it is anti-optimized: it uses the pattern that is worst in terms of speed.)
All the above may sound harsh, but it isn't intended to be, at all. I do not know you, and I am not trying to evaluate you; I am only pointing out the errors in this particular answer, in your current understanding, and only in the hopes that it helps you, and anyone else encountering this question in similar circumstances, to learn.

How to use AVX/SIMD with nested loops and += format?

I am writing a page rank program. I am writing a method for updating the rankings. I have successful got it working with nested for loops and also a threaded version. However I would like to instead use SIMD/AVX.
This is the code I would like to change into a SIMD/AVX implementation.
#define IDX(a, b) ((a * npages) + b) // 2D matrix indexing
for (size_t i = 0; i < npages; i++) {
temp[i] = 0.0;
for (size_t j = 0; j < npages; j++) {
temp[i] += P[j] * matrix_cap[IDX(i,j)];
}
}
For this code P[] is of size npages and matrix_cap[] is of size npages * npages. P[] is the ranks of the pages and temp[] is used to store the next iterations page ranks so as to be able to check convergence.
I don't know how to interpret += with AVX and how I would get my data which involves two arrays/vectors of size npages and one matrix of size npages * npages (in row major order) into a format of which could be used with SIMD/AVX operations.
As far as AVX this is what I have so far though it's very very incorrect and was just a stab at what I would roughly like to do.
ssize_t g_mod = npages - (npages % 4);
double* res = malloc(sizeof(double) * npages);
double sum = 0.0;
for (size_t i = 0; i < npages; i++) {
for (size_t j = 0; j < mod; j += 4) {
__m256d p = _mm256_loadu_pd(P + j);
__m256d m = _mm256_loadu_pd(matrix_hat + i + j);
__m256d pm = _mm256_mul_pd(p, m);
_mm256_storeu_pd(&res + j, pm);
for (size_t k = 0; k < 4; k++) {
sum += res[j + k];
}
}
for (size_t i = mod; i < npages; i++) {
for (size_t j = 0; j < npages; j++) {
sum += P[j] * matrix_cap[IDX(i,j)];
}
}
temp[i] = sum;
sum = 0.0;
}
How to can I format my data so I can use AVX/SIMD operations (add,mul) on it to optimise it as it will be called a lot.
Consider using OpenMP4.x #pragma omp simd reduction for innermost loop. Take in mind that omp reductions are not applicable to C++ arrays, therefore you have to use temporary reduction variable like shown below.
#define IDX(a, b) ((a * npages) + b) // 2D matrix indexing
for (size_t i = 0; i < npages; i++) {
my_type tmp_reduction = 0.0; // was: // temp[i] = 0.0;
#pragma omp simd reduction (+:tmp_reduction)
for (size_t j = 0; j < npages; j++) {
tmp_reduction += P[j] * matrix_cap[IDX(i,j)];
}
temp[i] = tmp_reduction;
}
For x86 platforms, OpenMP4.x is currently supported by fresh GCC (4.9+) and Intel Compilers. Some LLVM and PGI compilers may also support it.
P.S. Auto-vectorization ("auto" means vectorization by compiler without any pragmas, i.e. without explicit gudiance from developers) may sometimes work for some compiler variants (although it's very unlikely due to array element as reduction variable). However it is strictly speaking incorrect to auto-vectorize this code. You have to use explicit SIMD pragma to "resolve" reduction dependency and (as a good side-effect) disambiguate pointers (in case arrays are accessed via pointer).
First, EOF is right, you should see how well gcc/clang/icc do at auto-vectorizing your scalar code. I can't check for you, because you only posted code-fragments, not anything I can throw on http://gcc.godbolt.org/.
You definitely don't need to malloc anything. Notice that your intrinsics version only ever uses 32B at a time of res[], and always overwrites whatever was there before. So you might as well use a single 32B array. Or better, use a better method to get a horizontal sum of your vector.
(see the bottom for a suggestion on a different data arrangement for the matrix)
Calculating each temp[i] uses every P[j], so there is actually something to be gained from being smarter about vectorizing. For every load from P[j], use that vector with 4 different loads from matrix_cap[] for that j, but 4 different i values. You'll accumulate 4 different vectors, and have to hsum each of them down to a temp[i] value at the end.
So your inner loop will have 5 read streams (P[] and 4 different rows of matrix_cap). It will do 4 horizontal sums, and 4 scalar stores at the end, with the final result for 4 consecutive i values. (Or maybe do two shuffles and two 16B stores). (Or maybe transpose-and-sum together, which is actually a good use-case for the shuffling power of the expensive _mm256_hadd_pd (vhaddpd) instruction, but be careful of its in-lane operation)
It's probably even better to accumulate 8 to 12 temp[i] values in parallel, so every load from P[j] is reused 8 to 12 times. (check the compiler output to make sure you aren't running out of vector regs and spilling __m256d vectors to memory, though.) This will leave more work for the cleanup loop.
FMA throughput and latency are such that you need 10 vector accumulators to keep 10 FMAs in flight to saturate the FMA unit on Haswell. Skylake reduced the latency to 4c, so you only need 8 vector accumulators to saturate it on SKL. (See the x86 tag wiki). Even if you're bottlenecked on memory, not execution-port throughput, you will want multiple accumulators, but they could all be for the same temp[i] (so you'd vertically sum them down to one vector, then hsum that).
However, accumulating results for multiple temp[i] at once has the large advantage of reusing P[j] multiple times after loading it. You also save the vertical adds at the end. Multiple read streams may actually help hide the latency of a cache miss in any one of the streams. (HW prefetchers in Intel CPUs can track one forward and one reverse stream per 4k page, IIRC). You might strike a balance, and use two or three vector accumulators for each of 4 temp[i] results in parallel, if you find that multiple read streams are a problem, but that would mean you'd have to load the same P[j] more times total.
So you should do something like
#define IDX(a, b) ((a * npages) + b) // 2D matrix indexing
for (size_t i = 0; i < (npages & (~7ULL)); i+=8) {
__m256d s0 = _mm256_setzero_pd(),
s1 = _mm256_setzero_pd(),
s2 = _mm256_setzero_pd(),
...
s7 = _mm256_setzero_pd(); // 8 accumulators for 8 i values
for (size_t j = 0; j < (npages & ~(3ULL)); j+=4) {
__m256d Pj = _mm256_loadu_pd(P+j); // reused 8 times after loading
//temp[i] += P[j] * matrix_cap[IDX(i,j)];
s0 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+0,j)]), s0);
s1 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+1,j)]), s1);
// ...
s7 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+7,j)]), s7);
}
// or do this block with a hsum+transpose and do vector stores.
// taking advantage of the power of vhaddpd to be doing 4 useful hsums with each instructions.
temp[i+0] = hsum_pd256(s0); // See the horizontal-sum link earlier for how to write this function
temp[i+1] = hsum_pd256(s1);
//...
temp[i+7] = hsum_pd256(s7);
// if npages isn't a multiple of 4, add the last couple scalar elements to the results of the hsum_pd256()s.
}
// TODO: cleanup for the last up-to-7 odd elements.
You could probably write __m256d sums[8] and loop over your vector accumulators, but you'd have to check that the compiler fully unrolls it and still actually keeps everything live in registers.
How to can I format my data so I can use AVX/SIMD operations (add,mul) on it to optimise it as it will be called a lot.
I missed this part of the question earlier. First of all, obviously float will and give you 2x the number of elements per vector (and per unit of memory bandwidth). The factor of 2 less memory / cache footprint might give more speedup than that if cache hit rate increases.
Ideally, the matrix would be "striped" to match the vector width. Every load from the matrix would get a vector of matrix_cap[IDX(i,j)] for 4 adjacent i values, but the next 32B would be the next j value for the same 4 i values. This means that each vector accumulator is accumulating the sum for a different i in each element, so no need for horizontal sums at the end.
P[j] stays linear, but you broadcast-load each element of it, for use with 8 vectors of 4 i values each (or 8 vec of 8 is for float). So you increase your reuse factor for P[j] loads by a factor of the vector width. Broadcast-loads are near-free on Haswell and later (still only take a load-port uop), and plenty cheap for this on SnB/IvB where they also take a shuffle-port uop.

Is this a proper for-loop optimization

For a homework assignment I need to optimize a loop to run in under 7.5 seconds. I think I may have done this because my code runs in 4 seconds. However, I am worried I am not doing it correctly because my instructor told us that anything too far under 7.5 seconds is probably wrong. So I am worried that I might not be doing things correctly. Here is the original code:
#include <stdio.h>
#include <stdlib.h>
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main (void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
for (i = 0; i < N_TIMES; i++) {
int j;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
}
return 0;
}
Here is my optimization:
for (i = 0; i < N_TIMES; i++) {
int j;
for (j = 0; j < ARRAY_SIZE/2; j += 20) {
sum += array[j] + array[j+1] + array[j+2] + array[j+3] + array[j+4] + array[j+5] + array[j+6] + array[j+7] + array[j+8] + array[j+9];
sum1 += array[j+10] + array[j+11] + array[j+12] + array[j+13] + array[j+14] + array[j+15] + array[j+16] + array[j+17] + array[j+18] + array[j+19];
}
}
sum += sum1;
Are these doing the same number of arithmetic operations? Did I change the code somehow or am I just optimizing well?
Your optimizations are not correct:
for (j = 0; j < ARRAY_SIZE/2; j += 20) {
You now loop half as many times in the inner loop as you should.
It could be optimized in two ways, one is to improve the algorithm, the technique is to improve it at instruction level i.e doing every operation at faster speed as you can. By looking at your code, it seems you're trying to achieve the second one and you're doing it quite rightly. One of the feature found in modern processor is use of "instruction pipelining", there're few stages of it. The order of code execution is -
IF Instruction Fetch
ID Instruction Decode
EX Execution
Mem Memory access
WB Write Back
These op could be done in parralel i.e while you're doing ID for an op, you can do IF for the next op in advance. In first technique,
sum += array[j];
in this implementation IF holds up for previous operation to become executed completely i.e in a result of stalled cpu cycles. IF, ID, EX, Mem, WB they all take 1 cpu cycle therefore 5 cpu cycle to complete the full instruction. But with loop unrolling,
sum += array[j]; // first op
sum += array[j+1]; // second op
sum += array[j+2];
sum += array[j+3];
sum += array[j+4]; // fifth op
in this implementation, while executing the first one's ID, doing IF is available for the second on a same cycle i.e simultaneously. On second cpu cycle, you're doing ID of first operation and IF of second operation; on 3rd cycle, you've IF on third op, ID on second
op and Ex on first op, therefore it's utilizing instruction level parallelism and reduces number of stalled cpu cycles.
Based on this technique a typical way of optimizing loop is "unrolling" it ie. loop unrolling, you can get a full schematic view and details of "loop unrolling" and instruction pipeling in this link.
To get a proof of what I tried to explin, lets have a test. I've compiled your code and created two executable with two different
loop, I used perf to see to get an idea at how things went, the followings are the results:
Performance counter stats for './test':
17739.862565 task-clock # 1.000 CPUs utilized
183 context-switches # 0.010 K/sec
5 cpu-migrations # 0.000 K/sec
138 page-faults # 0.008 K/sec
===> 58,408,599,809 cycles # 3.293 GHz
===> 34,387,134,201 stalled-cycles-frontend # 58.87% frontend cycles idle
===> 4,229,714,038 stalled-cycles-backend # 7.24% backend cycles idle
72,056,092,464 instructions # 1.23 insns per cycle
# 0.48 stalled cycles per insn
6,011,271,479 branches # 338.857 M/sec
618,206 branch-misses # 0.01% of all branches
17.744254427 seconds time elapsed
and now with unroll-loop-test:
Performance counter stats for './unroll-loop-test':
2395.115499 task-clock # 1.000 CPUs utilized
22 context-switches # 0.009 K/sec
2 cpu-migrations # 0.001 K/sec
138 page-faults # 0.058 K/sec
====> 7,885,935,372 cycles # 3.293 GHz
====> 1,569,263,256 stalled-cycles-frontend # 19.90% frontend cycles idle
====> 50,629,264 stalled-cycles-backend # 0.64% backend cycles idle
24,911,629,893 instructions # 3.16 insns per cycle
# 0.06 stalled cycles per insn
153,158,495 branches # 63.946 M/sec
607,999 branch-misses # 0.40% of all branches
2.395806562 seconds time elapsed
Take a close look at the number of cycles executed, with unroll loop - stalled-cycles are much less thus requires less number
of cpu cycles, on the other hand - without unrolling - number of stalled-cycles is consuming more cpu cycles and thus poor
performance. So, yes you're doing quite nice optimization and they're executing same number of arithmatic operations. But also remember that, if you're running this program on a multiprocessor system, then another level of optimization would be to split the whole program into few parts and assign each part to each CPU available on the system and that is something known as "Parallel Programming". Hope my answer will clarify your concept.
you can do:
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
int j;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
sum *= N_TIMES;
return 0;
but this reduces the operations... to maintain the operations, this will maintain cache hits, and even register hits.
int main (void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
int j;
double d;
for (j = 0; j < ARRAY_SIZE; j++) {
d = array[j];
for (i = 0; i < N_TIMES; i++) {
sum += d;
}
}
return 0;
}
Calloc sets all elements in the array to zero (actually all bits are set to zero). So you are really adding zero a bunch of times.
So let me run through some ways to potentially go faster, beyond what you are doing (which is good, you are avoiding comparisons, though if your array size wasn't a multiple of 20 or whatever, you would have issues).
It may or may not be slightly faster to initialize your array statically and set the values to zero;
double array[ARRAY_SIZE] = {0};
Technically {} should work,but {0} is probably more explicit.
a for loop will reinitialize j to 0 every time. Declare int j outside of both loops and you probably save ARRAY_SIZE operations.
In general if the numbers in an array follow some arithmetical sequence, it may be possible to reduce the loop into an equation.
For example Carl Friedrich Gauss supposedly figured out as a child that if your sequence is 1 2 3 4 .. n (n is the last number) then the sum is (n * (n + 1)) / 2
If n is 4, 1 + 2 + 3 + 4 = 10 and (4 * 5) /2 also equals ten.
There are other sequences, like the sum of consecutive squared numbers IE (1^2 + 2^2 + 3^2 + 4^2.. n^2). Read https://en.wikipedia.org/wiki/Square_pyramidal_number for more on that.
Anyway my point is understanding math is important to optimization.
In your case all your numbers are the same, which means you could just multiply the number by ARRAY_SIZE and N_TIMES. The only time where this would maybe give a different answer is where you would overflow the max size of a double. Further, they are all 0, so that you don't ever have to do that;
You could potentially do something like:
int i, j;
double d;
for (i = 0; i < ARRAY_SIZE; i++) {
if (array[i] == 0)
continue;
for (j = 0; j < N_TIMES; j++) {
d += array[i];
}
}
That's untested, because I doubt it would be acceptable, but the pattern, skipping to the next iteration of the loop in a common case to avoid subsequent unnecessary instructions, is a common optimization practice.
In an unoptimized compiler, using a pointer may be faster then an index.
IE, you could loop with:
double * arrayend = array + (ARRAY_SIZE - 1);
double * valuep;
for(valuep = array; valuep <= arrayend; valuep++) {
//inner stuff
}
!= may be faster then < , though don't use equality for comparing non integers.
Using unsigned numbers MAY be faster, though probably not in your case. Signed vs Unsigned operations in C
Integers are probably faster then doubles, but may not be big enough for actual math.
Calloc sets all elements in the array to zero (actually all bits are set to zero). So you are really adding zero a bunch of times.
So let me run through some ways to potentially go faster, beyond what you are doing (which is good, you are avoiding comparisons, though if your array size wasn't a multiple of 20 or whatever, you would have issues).
It may or may not be slightly faster to initialize your array statically and set the values to zero;
double array[ARRAY_SIZE] = {0};
Technically {} should work,but {0} is probably more explicit.
a for loop will reinitialize j to 0 every time. Declare int j outside of both loops and you probably save ARRAY_SIZE operations.
In general if the numbers in an array follow some arithmetical sequence, it may be possible to reduce the loop into an equation.
For example Carl Friedrich Gauss supposedly figured out as a child that if your sequence is 1 2 3 4 .. n (n is the last number) then the sum is (n * (n + 1)) / 2
If n is 4, 1 + 2 + 3 + 4 = 10 and (4 * 5) /2 also equals ten.
There are other sequences, like the sum of consecutive squared numbers IE (1^2 + 2^2 + 3^2 + 4^2.. n^2). Read https://en.wikipedia.org/wiki/Square_pyramidal_number for more on that.
Anyway my point is understanding math is important to optimization.
In your case all your numbers are the same, which means you could just multiply the number by ARRAY_SIZE and N_TIMES. The only time where this would maybe give a different answer is where you would overflow the max size of a double. Further, they are all 0, so that you don't ever have to do that;
You could potentially do something like:
int i, j;
double d;
for (i = 0; i < ARRAY_SIZE; i++) {
if (array[i] == 0)
continue;
for (j = 0; j < N_TIMES; j++) {
d += array[i];
}
}
That's untested, because I doubt it would be acceptable, but the pattern, skipping to the next iteration of the loop in a common case to avoid subsequent unnecessary instructions, is a common optimization practice.
In an unoptimized compiler, using a pointer may be faster then an index.
IE, you could loop with:
double * arrayend = array + (ARRAY_SIZE - 1);
double * valuep;
for(valuep = array; valuep <= arrayend; valuep++) {
//inner stuff
}
!= may be faster then < , though don't use equality for comparing non integers.
Using unsigned numbers MAY be faster, though probably not in your case. Signed vs Unsigned operations in C
Integers are probably faster then doubles, but may not be big enough for actual math.
edit: also another one. if you know the size of the cache value of the system, you can potentially optimize for that.

Resources