How to use AVX/SIMD with nested loops and += format? - c

I am writing a page rank program. I am writing a method for updating the rankings. I have successful got it working with nested for loops and also a threaded version. However I would like to instead use SIMD/AVX.
This is the code I would like to change into a SIMD/AVX implementation.
#define IDX(a, b) ((a * npages) + b) // 2D matrix indexing
for (size_t i = 0; i < npages; i++) {
temp[i] = 0.0;
for (size_t j = 0; j < npages; j++) {
temp[i] += P[j] * matrix_cap[IDX(i,j)];
}
}
For this code P[] is of size npages and matrix_cap[] is of size npages * npages. P[] is the ranks of the pages and temp[] is used to store the next iterations page ranks so as to be able to check convergence.
I don't know how to interpret += with AVX and how I would get my data which involves two arrays/vectors of size npages and one matrix of size npages * npages (in row major order) into a format of which could be used with SIMD/AVX operations.
As far as AVX this is what I have so far though it's very very incorrect and was just a stab at what I would roughly like to do.
ssize_t g_mod = npages - (npages % 4);
double* res = malloc(sizeof(double) * npages);
double sum = 0.0;
for (size_t i = 0; i < npages; i++) {
for (size_t j = 0; j < mod; j += 4) {
__m256d p = _mm256_loadu_pd(P + j);
__m256d m = _mm256_loadu_pd(matrix_hat + i + j);
__m256d pm = _mm256_mul_pd(p, m);
_mm256_storeu_pd(&res + j, pm);
for (size_t k = 0; k < 4; k++) {
sum += res[j + k];
}
}
for (size_t i = mod; i < npages; i++) {
for (size_t j = 0; j < npages; j++) {
sum += P[j] * matrix_cap[IDX(i,j)];
}
}
temp[i] = sum;
sum = 0.0;
}
How to can I format my data so I can use AVX/SIMD operations (add,mul) on it to optimise it as it will be called a lot.

Consider using OpenMP4.x #pragma omp simd reduction for innermost loop. Take in mind that omp reductions are not applicable to C++ arrays, therefore you have to use temporary reduction variable like shown below.
#define IDX(a, b) ((a * npages) + b) // 2D matrix indexing
for (size_t i = 0; i < npages; i++) {
my_type tmp_reduction = 0.0; // was: // temp[i] = 0.0;
#pragma omp simd reduction (+:tmp_reduction)
for (size_t j = 0; j < npages; j++) {
tmp_reduction += P[j] * matrix_cap[IDX(i,j)];
}
temp[i] = tmp_reduction;
}
For x86 platforms, OpenMP4.x is currently supported by fresh GCC (4.9+) and Intel Compilers. Some LLVM and PGI compilers may also support it.
P.S. Auto-vectorization ("auto" means vectorization by compiler without any pragmas, i.e. without explicit gudiance from developers) may sometimes work for some compiler variants (although it's very unlikely due to array element as reduction variable). However it is strictly speaking incorrect to auto-vectorize this code. You have to use explicit SIMD pragma to "resolve" reduction dependency and (as a good side-effect) disambiguate pointers (in case arrays are accessed via pointer).

First, EOF is right, you should see how well gcc/clang/icc do at auto-vectorizing your scalar code. I can't check for you, because you only posted code-fragments, not anything I can throw on http://gcc.godbolt.org/.
You definitely don't need to malloc anything. Notice that your intrinsics version only ever uses 32B at a time of res[], and always overwrites whatever was there before. So you might as well use a single 32B array. Or better, use a better method to get a horizontal sum of your vector.
(see the bottom for a suggestion on a different data arrangement for the matrix)
Calculating each temp[i] uses every P[j], so there is actually something to be gained from being smarter about vectorizing. For every load from P[j], use that vector with 4 different loads from matrix_cap[] for that j, but 4 different i values. You'll accumulate 4 different vectors, and have to hsum each of them down to a temp[i] value at the end.
So your inner loop will have 5 read streams (P[] and 4 different rows of matrix_cap). It will do 4 horizontal sums, and 4 scalar stores at the end, with the final result for 4 consecutive i values. (Or maybe do two shuffles and two 16B stores). (Or maybe transpose-and-sum together, which is actually a good use-case for the shuffling power of the expensive _mm256_hadd_pd (vhaddpd) instruction, but be careful of its in-lane operation)
It's probably even better to accumulate 8 to 12 temp[i] values in parallel, so every load from P[j] is reused 8 to 12 times. (check the compiler output to make sure you aren't running out of vector regs and spilling __m256d vectors to memory, though.) This will leave more work for the cleanup loop.
FMA throughput and latency are such that you need 10 vector accumulators to keep 10 FMAs in flight to saturate the FMA unit on Haswell. Skylake reduced the latency to 4c, so you only need 8 vector accumulators to saturate it on SKL. (See the x86 tag wiki). Even if you're bottlenecked on memory, not execution-port throughput, you will want multiple accumulators, but they could all be for the same temp[i] (so you'd vertically sum them down to one vector, then hsum that).
However, accumulating results for multiple temp[i] at once has the large advantage of reusing P[j] multiple times after loading it. You also save the vertical adds at the end. Multiple read streams may actually help hide the latency of a cache miss in any one of the streams. (HW prefetchers in Intel CPUs can track one forward and one reverse stream per 4k page, IIRC). You might strike a balance, and use two or three vector accumulators for each of 4 temp[i] results in parallel, if you find that multiple read streams are a problem, but that would mean you'd have to load the same P[j] more times total.
So you should do something like
#define IDX(a, b) ((a * npages) + b) // 2D matrix indexing
for (size_t i = 0; i < (npages & (~7ULL)); i+=8) {
__m256d s0 = _mm256_setzero_pd(),
s1 = _mm256_setzero_pd(),
s2 = _mm256_setzero_pd(),
...
s7 = _mm256_setzero_pd(); // 8 accumulators for 8 i values
for (size_t j = 0; j < (npages & ~(3ULL)); j+=4) {
__m256d Pj = _mm256_loadu_pd(P+j); // reused 8 times after loading
//temp[i] += P[j] * matrix_cap[IDX(i,j)];
s0 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+0,j)]), s0);
s1 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+1,j)]), s1);
// ...
s7 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+7,j)]), s7);
}
// or do this block with a hsum+transpose and do vector stores.
// taking advantage of the power of vhaddpd to be doing 4 useful hsums with each instructions.
temp[i+0] = hsum_pd256(s0); // See the horizontal-sum link earlier for how to write this function
temp[i+1] = hsum_pd256(s1);
//...
temp[i+7] = hsum_pd256(s7);
// if npages isn't a multiple of 4, add the last couple scalar elements to the results of the hsum_pd256()s.
}
// TODO: cleanup for the last up-to-7 odd elements.
You could probably write __m256d sums[8] and loop over your vector accumulators, but you'd have to check that the compiler fully unrolls it and still actually keeps everything live in registers.
How to can I format my data so I can use AVX/SIMD operations (add,mul) on it to optimise it as it will be called a lot.
I missed this part of the question earlier. First of all, obviously float will and give you 2x the number of elements per vector (and per unit of memory bandwidth). The factor of 2 less memory / cache footprint might give more speedup than that if cache hit rate increases.
Ideally, the matrix would be "striped" to match the vector width. Every load from the matrix would get a vector of matrix_cap[IDX(i,j)] for 4 adjacent i values, but the next 32B would be the next j value for the same 4 i values. This means that each vector accumulator is accumulating the sum for a different i in each element, so no need for horizontal sums at the end.
P[j] stays linear, but you broadcast-load each element of it, for use with 8 vectors of 4 i values each (or 8 vec of 8 is for float). So you increase your reuse factor for P[j] loads by a factor of the vector width. Broadcast-loads are near-free on Haswell and later (still only take a load-port uop), and plenty cheap for this on SnB/IvB where they also take a shuffle-port uop.

Related

Tiled Matrix Multiplication using AVX

I have coded the following C function for multiplying two NxN matrices using tiling/blocking and AVX vectors to speed up the calculation. Right now though I'm getting a segmentation fault when I try to combine AVX intrinsics with tiling. Any idea why that happens?
Also, is there a better memory access pattern for matrix B? Maybe transposing it first or even changing the k and j loop? Because right now, I'm traversing it column-wise which is probably not very efficient in regards to spatial locality and cache lines.
1 void mmult(double A[SIZE_M][SIZE_N], double B[SIZE_N][SIZE_K], double C[SIZE_M][SIZE_K])
2 {
3 int i, j, k, i0, j0, k0;
4 // double sum;
5 __m256d sum;
6 for(i0 = 0; i0 < SIZE_M; i0 += BLOCKSIZE) {
7 for(k0 = 0; k0 < SIZE_N; k0 += BLOCKSIZE) {
8 for(j0 = 0; j0 < SIZE_K; j0 += BLOCKSIZE) {
9 for (i = i0; i < MIN(i0+BLOCKSIZE, SIZE_M); i++) {
10 for (j = j0; j < MIN(j0+BLOCKSIZE, SIZE_K); j++) {
11 // sum = C[i][j];
12 sum = _mm256_load_pd(&C[i][j]);
13 for (k = k0; k < MIN(k0+BLOCKSIZE, SIZE_N); k++) {
14 // sum += A[i][k] * B[k][j];
15 sum = _mm256_add_pd(sum, _mm256_mul_pd(_mm256_load_pd(&A[i][k]), _mm256_broadcast_sd(&B[k][j])));
16 }
17 // C[i][j] = sum;
18 _mm256_store_pd(&C[i][j], sum);
19 }
20 }
21 }
22 }
23 }
24 }
_mm256_load_pd is an alignment-required load but you're only stepping by k++, not k+=4 in the inner-most loop that loads a 32-byte vector of 4 doubles. So it faults because 3 of every 4 loads are misaligned.
You don't want to be doing overlapping loads, your real bug is the indexing; if your input pointers are 32-byte aligned you should be able to keep using _mm256_load_pd instead of _mm256_loadu_pd. So using _mm256_load_pd successfully caught your bug instead of working but giving numerically wrong results.
Your strategy for vectorizing four row*column dot products (to produce a C[i][j+0..3] vector) should load 4 contiguous doubles from 4 different columns (B[k][j+0..3] via a vector load from B[k][j]), and broadcast 1 double from A[i][k]. Remember you want 4 dot products in parallel.
Another strategy might involve a horizontal sum at the end down to a scalar C[i][j] += horizontal_add(__m256d), but I think that would require transposing one input first so both row and column vectors are in contiguous memory for one dot product. But then you need shuffles for a horizontal sum at the end of each inner loop.
You probably also want to use at least 2 sum variables so you can read a whole cache line at once, and hide FMA latency in the inner loop and hopefully bottleneck on throughput. Or better do 4 or 8 vectors in parallel. So you produce C[i][j+0..15] as sum0, sum1, sum2, sum3. (Or use an array of __m256d; compilers will typically fully unroll a loop of 8 and optimize the array into registers.)
I think you only need 5 nested loops, to block over rows and columns. Although apparently 6 nested loops are a valid option: see loop tiling/blocking for large dense matrix multiplication which has a 5-nested loop in the question but a 6-nested loop in an answer. (Just scalar, though, not vectorized).
There might be other bugs besides the row*column dot product strategy here, I'm not sure.
If you're using AVX, you might want to use FMA as well, unless you need to run on Sandbybridge/Ivybridge, and AMD Bulldozer. (Piledriver and later have FMA3).
Other matmul strategies include adding into the destination inside the inner loop so you're loading C and A inside the inner loop, with a load from B hoisted. (Or B and A swapped, I forget.) What Every Programmer Should Know About Memory? has a vectorized cache-blocked example that works this way in an appendix, for SSE2 __m128d vectors. https://www.akkadia.org/drepper/cpumemory.pdf

Is vectorization profitable in this case?

I broke a kernel down to several loops, in order to vectorize each one of them afterwards. One of this loops looks like:
int *array1; //Its size is "size+1";
int *array2; //Its size is "size+1";
//All positions of array1 and array2 are set to 0 here;
int *sArray1 = array1+1; //Shift one position so I start writing on pos 1
int *sArray2 = array2+1; //Shift one position so I start writing on pos 1
int bb = 0;
for(int i=0; i<size; i++){
if(A[i] + bb > B[i]){
bb = 1;
sArray1[i] = S;
sArray2[i] = 1;
}
else
bb = 0;
}
Please note the loop-carried dependency, in bb - each comparison depends upon bb's value, which is modified on the previous iteration.
What I thought about:
I can be absolutely certain of some cases. For example, when A[i] is already greater than B[i], I do not need to know what value bb carries from the previous iteration;
When A[i] equals B[i], I need to know what value bb carries from the previous iteration. However, I also need to account for the case when this happens in two consecutive positions; When I started to shape up these cases, it seemed that these becomes overly complicated and vectorization doesn't pay off.
Essentially, I'd like to know if this can be vectorized in an effective manner or if it is simply better to run this without any vectorization whatsoever.
You might not want to iterate over single elements, but have a loop over the chunks (where a chunk is defined by all elements within yielding the same bb).
The search for chunk boundraries could be vectorized (by hand using compiler specific SIMD intrinics probably).
And the action to be taken for single chunk of bb=1 could be vectorized, too.
The loop transformation is as follows:
size_t i_chunk_start = 0, i_chunk_end;
int bb_chunk = A[0] > B[0] ? 1 : 0;
while (i_chunk_start < isize) {
if(bb_chunk) {
/* find end of current chunk */
for (i_chunk_end = i_chunk_start + 1; i_chunk_end < isize; ++i_chunk_end) {
if(A[i_chunk_end] < B[i_chunk_end]) {
break;
}
}
/* process current chunk */
for(size_t i = i_chunk_start; i < i_chunk_end; ++i) {
sArray1[i] = S;
sArray2[i] = 1;
}
bb_chunk = 0;
} else {
/* find end of current chunk */
for (i_chunk_end = i_chunk_start + 1; i_chunk_end < isize; ++i_chunk_end) {
if(A[i_chunk_end] > B[i_chunk_end]) {
break;
}
}
bb_chunk = 1;
}
/* prepare for next chunk */
i_chunk_start = i_chunk_end;
}
Now, each of the inner loops (all for loops) could potentially get vectorized.
Whether or not vectorization in this manner is superior to non-vectorization depends on whether the chunks have sufficient length in average. You will only find out by benchmarking.
The effect of your loop body depends on two conditions:
A[i] > B[i]
A[i] + 1 > B[i]
Their calculation can be vectorized easily. Assuming int has 32 bits, and vectorized instructions work on 4 int values at a time, there are 8 bits per vectorized iteration (4 bits for each condition).
You can harvest those bits from a SSE register by _mm_movemask_epi8. It's a bit inconvenient that it works on bytes and not on ints, but you can take care of it by a suitable shuffle.
Afterwards, use the 8 bits as an address to a LUT (of 256 entries), which stores 4-bit masks. These masks can be used to store the elements into destination conditionally, using _mm_maskmoveu_si128.
I am not sure such a complicated program is worthwhile - it involves much bit-fiddling for just x4 improvement in speed. Maybe it's better to build the masks by examining the decision bits individually. But vectorizing your comparisons and stores seems worthwhile in any case.

What memory access patterns are most efficient for outer-product-type double loops?

What access patterns are most efficient for writing cache-efficient outer-product type code that maximally exploits data data locality?
Consider a block of code for processing all pairs of elements of two arrays such as:
for (int i = 0; i < N; i++)
for (int j = 0; j < M; j++)
out[i*M + j] = X[i] binary-op Y[j];
This is a standard vector-vector outer product when binary-op is scalar multiplication and X and Y are 1d, but this same pattern is also matrix multiplication when X and Y are matrices and binary-op is a dot product between the ith row and j-th column of two matrices.
For matrix multiplication, I know optimized BLASs like OpenBLAS and MKL can get much higher performance than you get from the double loop style code above, because they process the elements in chunks in such a way as to exploit the CPU cache much more. Unfortunately, OpenBLAS kernels are written in assembly so it's pretty difficult to figure out what's going on.
Are there any good "tricks of the trade" for re-organizing these types of double loops to improve cache performance?
Since each element of out is only hit once, we're clearly free to reorder the iterations. The straight linear traversal of out is the easiest to write, but I don't think it's the most efficient pattern to execute, since you don't exploit any locality in X.
I'm especially interested in the setting where M and N are large, and the size of each element (X[i], and Y[j]) is pretty small (like O(1) bytes), so were talking about something analogous to vector-vector outer product or the multiplication of a tall and skinny matrix by a short and fat matrix (e.g. N x D by D x M where D is small).
For large enough M, The Y vector will exceed the L1 cache size.* Thus on every new outer iteration, you'll be reloading Y from main memory (or at least, a slower cache). In other words, you won't be exploiting temporal locality in Y.
You should block up your accesses to Y; something like this:
for (jj = 0; jj < M; jj += CACHE_SIZE) { // Iterate over blocks
for (i = 0; i < N; i++) {
for (j = jj; j < (jj + CACHE_SIZE); j++) { // Iterate within block
out[i*M + j] = X[i] * Y[j];
}
}
}
The above doesn't do anything smart with accesses to X, but new values are only being accessed 1/CACHE_SIZE as often, so the impact is probably negligible.
* If everything is small enough to already fit in cache, then you can't do better than what you already have (vectorisation opportunities notwithstanding).

Is this a proper for-loop optimization

For a homework assignment I need to optimize a loop to run in under 7.5 seconds. I think I may have done this because my code runs in 4 seconds. However, I am worried I am not doing it correctly because my instructor told us that anything too far under 7.5 seconds is probably wrong. So I am worried that I might not be doing things correctly. Here is the original code:
#include <stdio.h>
#include <stdlib.h>
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main (void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
for (i = 0; i < N_TIMES; i++) {
int j;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
}
return 0;
}
Here is my optimization:
for (i = 0; i < N_TIMES; i++) {
int j;
for (j = 0; j < ARRAY_SIZE/2; j += 20) {
sum += array[j] + array[j+1] + array[j+2] + array[j+3] + array[j+4] + array[j+5] + array[j+6] + array[j+7] + array[j+8] + array[j+9];
sum1 += array[j+10] + array[j+11] + array[j+12] + array[j+13] + array[j+14] + array[j+15] + array[j+16] + array[j+17] + array[j+18] + array[j+19];
}
}
sum += sum1;
Are these doing the same number of arithmetic operations? Did I change the code somehow or am I just optimizing well?
Your optimizations are not correct:
for (j = 0; j < ARRAY_SIZE/2; j += 20) {
You now loop half as many times in the inner loop as you should.
It could be optimized in two ways, one is to improve the algorithm, the technique is to improve it at instruction level i.e doing every operation at faster speed as you can. By looking at your code, it seems you're trying to achieve the second one and you're doing it quite rightly. One of the feature found in modern processor is use of "instruction pipelining", there're few stages of it. The order of code execution is -
IF Instruction Fetch
ID Instruction Decode
EX Execution
Mem Memory access
WB Write Back
These op could be done in parralel i.e while you're doing ID for an op, you can do IF for the next op in advance. In first technique,
sum += array[j];
in this implementation IF holds up for previous operation to become executed completely i.e in a result of stalled cpu cycles. IF, ID, EX, Mem, WB they all take 1 cpu cycle therefore 5 cpu cycle to complete the full instruction. But with loop unrolling,
sum += array[j]; // first op
sum += array[j+1]; // second op
sum += array[j+2];
sum += array[j+3];
sum += array[j+4]; // fifth op
in this implementation, while executing the first one's ID, doing IF is available for the second on a same cycle i.e simultaneously. On second cpu cycle, you're doing ID of first operation and IF of second operation; on 3rd cycle, you've IF on third op, ID on second
op and Ex on first op, therefore it's utilizing instruction level parallelism and reduces number of stalled cpu cycles.
Based on this technique a typical way of optimizing loop is "unrolling" it ie. loop unrolling, you can get a full schematic view and details of "loop unrolling" and instruction pipeling in this link.
To get a proof of what I tried to explin, lets have a test. I've compiled your code and created two executable with two different
loop, I used perf to see to get an idea at how things went, the followings are the results:
Performance counter stats for './test':
17739.862565 task-clock # 1.000 CPUs utilized
183 context-switches # 0.010 K/sec
5 cpu-migrations # 0.000 K/sec
138 page-faults # 0.008 K/sec
===> 58,408,599,809 cycles # 3.293 GHz
===> 34,387,134,201 stalled-cycles-frontend # 58.87% frontend cycles idle
===> 4,229,714,038 stalled-cycles-backend # 7.24% backend cycles idle
72,056,092,464 instructions # 1.23 insns per cycle
# 0.48 stalled cycles per insn
6,011,271,479 branches # 338.857 M/sec
618,206 branch-misses # 0.01% of all branches
17.744254427 seconds time elapsed
and now with unroll-loop-test:
Performance counter stats for './unroll-loop-test':
2395.115499 task-clock # 1.000 CPUs utilized
22 context-switches # 0.009 K/sec
2 cpu-migrations # 0.001 K/sec
138 page-faults # 0.058 K/sec
====> 7,885,935,372 cycles # 3.293 GHz
====> 1,569,263,256 stalled-cycles-frontend # 19.90% frontend cycles idle
====> 50,629,264 stalled-cycles-backend # 0.64% backend cycles idle
24,911,629,893 instructions # 3.16 insns per cycle
# 0.06 stalled cycles per insn
153,158,495 branches # 63.946 M/sec
607,999 branch-misses # 0.40% of all branches
2.395806562 seconds time elapsed
Take a close look at the number of cycles executed, with unroll loop - stalled-cycles are much less thus requires less number
of cpu cycles, on the other hand - without unrolling - number of stalled-cycles is consuming more cpu cycles and thus poor
performance. So, yes you're doing quite nice optimization and they're executing same number of arithmatic operations. But also remember that, if you're running this program on a multiprocessor system, then another level of optimization would be to split the whole program into few parts and assign each part to each CPU available on the system and that is something known as "Parallel Programming". Hope my answer will clarify your concept.
you can do:
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
int j;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
sum *= N_TIMES;
return 0;
but this reduces the operations... to maintain the operations, this will maintain cache hits, and even register hits.
int main (void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
int j;
double d;
for (j = 0; j < ARRAY_SIZE; j++) {
d = array[j];
for (i = 0; i < N_TIMES; i++) {
sum += d;
}
}
return 0;
}
Calloc sets all elements in the array to zero (actually all bits are set to zero). So you are really adding zero a bunch of times.
So let me run through some ways to potentially go faster, beyond what you are doing (which is good, you are avoiding comparisons, though if your array size wasn't a multiple of 20 or whatever, you would have issues).
It may or may not be slightly faster to initialize your array statically and set the values to zero;
double array[ARRAY_SIZE] = {0};
Technically {} should work,but {0} is probably more explicit.
a for loop will reinitialize j to 0 every time. Declare int j outside of both loops and you probably save ARRAY_SIZE operations.
In general if the numbers in an array follow some arithmetical sequence, it may be possible to reduce the loop into an equation.
For example Carl Friedrich Gauss supposedly figured out as a child that if your sequence is 1 2 3 4 .. n (n is the last number) then the sum is (n * (n + 1)) / 2
If n is 4, 1 + 2 + 3 + 4 = 10 and (4 * 5) /2 also equals ten.
There are other sequences, like the sum of consecutive squared numbers IE (1^2 + 2^2 + 3^2 + 4^2.. n^2). Read https://en.wikipedia.org/wiki/Square_pyramidal_number for more on that.
Anyway my point is understanding math is important to optimization.
In your case all your numbers are the same, which means you could just multiply the number by ARRAY_SIZE and N_TIMES. The only time where this would maybe give a different answer is where you would overflow the max size of a double. Further, they are all 0, so that you don't ever have to do that;
You could potentially do something like:
int i, j;
double d;
for (i = 0; i < ARRAY_SIZE; i++) {
if (array[i] == 0)
continue;
for (j = 0; j < N_TIMES; j++) {
d += array[i];
}
}
That's untested, because I doubt it would be acceptable, but the pattern, skipping to the next iteration of the loop in a common case to avoid subsequent unnecessary instructions, is a common optimization practice.
In an unoptimized compiler, using a pointer may be faster then an index.
IE, you could loop with:
double * arrayend = array + (ARRAY_SIZE - 1);
double * valuep;
for(valuep = array; valuep <= arrayend; valuep++) {
//inner stuff
}
!= may be faster then < , though don't use equality for comparing non integers.
Using unsigned numbers MAY be faster, though probably not in your case. Signed vs Unsigned operations in C
Integers are probably faster then doubles, but may not be big enough for actual math.
Calloc sets all elements in the array to zero (actually all bits are set to zero). So you are really adding zero a bunch of times.
So let me run through some ways to potentially go faster, beyond what you are doing (which is good, you are avoiding comparisons, though if your array size wasn't a multiple of 20 or whatever, you would have issues).
It may or may not be slightly faster to initialize your array statically and set the values to zero;
double array[ARRAY_SIZE] = {0};
Technically {} should work,but {0} is probably more explicit.
a for loop will reinitialize j to 0 every time. Declare int j outside of both loops and you probably save ARRAY_SIZE operations.
In general if the numbers in an array follow some arithmetical sequence, it may be possible to reduce the loop into an equation.
For example Carl Friedrich Gauss supposedly figured out as a child that if your sequence is 1 2 3 4 .. n (n is the last number) then the sum is (n * (n + 1)) / 2
If n is 4, 1 + 2 + 3 + 4 = 10 and (4 * 5) /2 also equals ten.
There are other sequences, like the sum of consecutive squared numbers IE (1^2 + 2^2 + 3^2 + 4^2.. n^2). Read https://en.wikipedia.org/wiki/Square_pyramidal_number for more on that.
Anyway my point is understanding math is important to optimization.
In your case all your numbers are the same, which means you could just multiply the number by ARRAY_SIZE and N_TIMES. The only time where this would maybe give a different answer is where you would overflow the max size of a double. Further, they are all 0, so that you don't ever have to do that;
You could potentially do something like:
int i, j;
double d;
for (i = 0; i < ARRAY_SIZE; i++) {
if (array[i] == 0)
continue;
for (j = 0; j < N_TIMES; j++) {
d += array[i];
}
}
That's untested, because I doubt it would be acceptable, but the pattern, skipping to the next iteration of the loop in a common case to avoid subsequent unnecessary instructions, is a common optimization practice.
In an unoptimized compiler, using a pointer may be faster then an index.
IE, you could loop with:
double * arrayend = array + (ARRAY_SIZE - 1);
double * valuep;
for(valuep = array; valuep <= arrayend; valuep++) {
//inner stuff
}
!= may be faster then < , though don't use equality for comparing non integers.
Using unsigned numbers MAY be faster, though probably not in your case. Signed vs Unsigned operations in C
Integers are probably faster then doubles, but may not be big enough for actual math.
edit: also another one. if you know the size of the cache value of the system, you can potentially optimize for that.

Improve C function performance with cache locality?

I have to find a diagonal difference in a matrix represented as 2d array and the function prototype is
int diagonal_diff(int x[512][512])
I have to use a 2d array, and the data is 512x512. This is tested on a SPARC machine: my current timing is 6ms but I need to be under 2ms.
Sample data:
[3][4][5][9]
[2][8][9][4]
[6][9][7][3]
[5][8][8][2]
The difference is:
|4-2| + |5-6| + |9-5| + |9-9| + |4-8| + |3-8| = 2 + 1 + 4 + 0 + 4 + 5 = 16
In order to do that, I use the following algorithm:
int i,j,result=0;
for(i=0; i<4; i++)
for(j=0; j<4; j++)
result+=abs(array[i][j]-[j][i]);
return result;
But this algorithm keeps accessing the column, row, column, row, etc which make inefficient use of cache.
Is there a way to improve my function?
EDIT: Why is a block oriented approach faster? We are taking advantage of the CPU's data cache by ensuring that whether we iterate over a block by row or by column, we guarantee that the entire block fits into the cache.
For example, if you have a cache line of 32-bytes and an int is 4 bytes, you can fit a 8x8 int matrix into 8 cache lines. Assuming you have a big enough data cache, you can iterate over that matrix either by row or by column and be guaranteed that you do not thrash the cache. Another way to think about it is if your matrix fits in the cache, you can traverse it any way you want.
If you have a matrix that is much bigger, say 512x512, then you need to tune your matrix traversal such that you don't thrash the cache. For example, if you traverse the matrix in the opposite order of the layout of the matrix, you will almost always miss the cache on every element you visit.
A block oriented approach ensures that you only have a cache miss for data you will eventually visit before the CPU has to flush that cache line. In other words, a block oriented approach tuned to the cache line size will ensure you don't thrash the cache.
So, if you are trying to optimize for the cache line size of the machine you are running on, you can iterate over the matrix in block form and ensure you only visit each matrix element once:
int sum_diagonal_difference(int array[512][512], int block_size)
{
int i,j, block_i, block_j,result=0;
// sum diagonal blocks
for (block_i= 0; block_i<512; block_i+= block_size)
for (block_j= block_i + block_size; block_j<512; block_j+= block_size)
for(i=0; i<block_size; i++)
for(j=0; j<block_size; j++)
result+=abs(array[block_i + i][block_j + j]-array[block_j + j][block_i + i]);
result+= result;
// sum diagonal
for (int block_offset= 0; block_offset<512; block_offset+= block_size)
{
for (i= 0; i<block_size; ++i)
{
for (j= i+1; j<block_size; ++j)
{
int value= abs(array[block_offset + i][block_offset + j]-array[block_offset + j][block_offset + i]);
result+= value + value;
}
}
}
return result;
}
You should experiment with various values for block_size. On my machine, 8 lead to the biggest speed up (2.5x) compared to a block_size of 1 (and ~5x compared to the original iteration over the entire matrix). The block_size should ideally be cache_line_size_in_bytes/sizeof(int).
If you have a good vector/matrix library like intel MKL, also try the vectorized way.
very simple in matlab:
result = sum(sum(abs(x-x')));
I reproduced Hans's method and MSN's method in matlab too, and the results are:
Elapsed time is 0.211480 seconds. (Hans)
Elapsed time is 0.009172 seconds. (MSN)
Elapsed time is 0.002193 seconds. (Mine)
With one minor change you can have your loops only operate on the desired indices. I just changed the j loop initialization.
int i, j, result = 0;
for (i = 0; i < 4; ++i) {
for (j = i + 1; j < 4; ++j) {
result += abs(array[i][j] - array[j][i]);
}
}

Resources