Is this a proper for-loop optimization - c

For a homework assignment I need to optimize a loop to run in under 7.5 seconds. I think I may have done this because my code runs in 4 seconds. However, I am worried I am not doing it correctly because my instructor told us that anything too far under 7.5 seconds is probably wrong. So I am worried that I might not be doing things correctly. Here is the original code:
#include <stdio.h>
#include <stdlib.h>
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main (void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
for (i = 0; i < N_TIMES; i++) {
int j;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
}
return 0;
}
Here is my optimization:
for (i = 0; i < N_TIMES; i++) {
int j;
for (j = 0; j < ARRAY_SIZE/2; j += 20) {
sum += array[j] + array[j+1] + array[j+2] + array[j+3] + array[j+4] + array[j+5] + array[j+6] + array[j+7] + array[j+8] + array[j+9];
sum1 += array[j+10] + array[j+11] + array[j+12] + array[j+13] + array[j+14] + array[j+15] + array[j+16] + array[j+17] + array[j+18] + array[j+19];
}
}
sum += sum1;
Are these doing the same number of arithmetic operations? Did I change the code somehow or am I just optimizing well?

Your optimizations are not correct:
for (j = 0; j < ARRAY_SIZE/2; j += 20) {
You now loop half as many times in the inner loop as you should.

It could be optimized in two ways, one is to improve the algorithm, the technique is to improve it at instruction level i.e doing every operation at faster speed as you can. By looking at your code, it seems you're trying to achieve the second one and you're doing it quite rightly. One of the feature found in modern processor is use of "instruction pipelining", there're few stages of it. The order of code execution is -
IF Instruction Fetch
ID Instruction Decode
EX Execution
Mem Memory access
WB Write Back
These op could be done in parralel i.e while you're doing ID for an op, you can do IF for the next op in advance. In first technique,
sum += array[j];
in this implementation IF holds up for previous operation to become executed completely i.e in a result of stalled cpu cycles. IF, ID, EX, Mem, WB they all take 1 cpu cycle therefore 5 cpu cycle to complete the full instruction. But with loop unrolling,
sum += array[j]; // first op
sum += array[j+1]; // second op
sum += array[j+2];
sum += array[j+3];
sum += array[j+4]; // fifth op
in this implementation, while executing the first one's ID, doing IF is available for the second on a same cycle i.e simultaneously. On second cpu cycle, you're doing ID of first operation and IF of second operation; on 3rd cycle, you've IF on third op, ID on second
op and Ex on first op, therefore it's utilizing instruction level parallelism and reduces number of stalled cpu cycles.
Based on this technique a typical way of optimizing loop is "unrolling" it ie. loop unrolling, you can get a full schematic view and details of "loop unrolling" and instruction pipeling in this link.
To get a proof of what I tried to explin, lets have a test. I've compiled your code and created two executable with two different
loop, I used perf to see to get an idea at how things went, the followings are the results:
Performance counter stats for './test':
17739.862565 task-clock # 1.000 CPUs utilized
183 context-switches # 0.010 K/sec
5 cpu-migrations # 0.000 K/sec
138 page-faults # 0.008 K/sec
===> 58,408,599,809 cycles # 3.293 GHz
===> 34,387,134,201 stalled-cycles-frontend # 58.87% frontend cycles idle
===> 4,229,714,038 stalled-cycles-backend # 7.24% backend cycles idle
72,056,092,464 instructions # 1.23 insns per cycle
# 0.48 stalled cycles per insn
6,011,271,479 branches # 338.857 M/sec
618,206 branch-misses # 0.01% of all branches
17.744254427 seconds time elapsed
and now with unroll-loop-test:
Performance counter stats for './unroll-loop-test':
2395.115499 task-clock # 1.000 CPUs utilized
22 context-switches # 0.009 K/sec
2 cpu-migrations # 0.001 K/sec
138 page-faults # 0.058 K/sec
====> 7,885,935,372 cycles # 3.293 GHz
====> 1,569,263,256 stalled-cycles-frontend # 19.90% frontend cycles idle
====> 50,629,264 stalled-cycles-backend # 0.64% backend cycles idle
24,911,629,893 instructions # 3.16 insns per cycle
# 0.06 stalled cycles per insn
153,158,495 branches # 63.946 M/sec
607,999 branch-misses # 0.40% of all branches
2.395806562 seconds time elapsed
Take a close look at the number of cycles executed, with unroll loop - stalled-cycles are much less thus requires less number
of cpu cycles, on the other hand - without unrolling - number of stalled-cycles is consuming more cpu cycles and thus poor
performance. So, yes you're doing quite nice optimization and they're executing same number of arithmatic operations. But also remember that, if you're running this program on a multiprocessor system, then another level of optimization would be to split the whole program into few parts and assign each part to each CPU available on the system and that is something known as "Parallel Programming". Hope my answer will clarify your concept.

you can do:
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
int j;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
sum *= N_TIMES;
return 0;
but this reduces the operations... to maintain the operations, this will maintain cache hits, and even register hits.
int main (void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
int j;
double d;
for (j = 0; j < ARRAY_SIZE; j++) {
d = array[j];
for (i = 0; i < N_TIMES; i++) {
sum += d;
}
}
return 0;
}

Calloc sets all elements in the array to zero (actually all bits are set to zero). So you are really adding zero a bunch of times.
So let me run through some ways to potentially go faster, beyond what you are doing (which is good, you are avoiding comparisons, though if your array size wasn't a multiple of 20 or whatever, you would have issues).
It may or may not be slightly faster to initialize your array statically and set the values to zero;
double array[ARRAY_SIZE] = {0};
Technically {} should work,but {0} is probably more explicit.
a for loop will reinitialize j to 0 every time. Declare int j outside of both loops and you probably save ARRAY_SIZE operations.
In general if the numbers in an array follow some arithmetical sequence, it may be possible to reduce the loop into an equation.
For example Carl Friedrich Gauss supposedly figured out as a child that if your sequence is 1 2 3 4 .. n (n is the last number) then the sum is (n * (n + 1)) / 2
If n is 4, 1 + 2 + 3 + 4 = 10 and (4 * 5) /2 also equals ten.
There are other sequences, like the sum of consecutive squared numbers IE (1^2 + 2^2 + 3^2 + 4^2.. n^2). Read https://en.wikipedia.org/wiki/Square_pyramidal_number for more on that.
Anyway my point is understanding math is important to optimization.
In your case all your numbers are the same, which means you could just multiply the number by ARRAY_SIZE and N_TIMES. The only time where this would maybe give a different answer is where you would overflow the max size of a double. Further, they are all 0, so that you don't ever have to do that;
You could potentially do something like:
int i, j;
double d;
for (i = 0; i < ARRAY_SIZE; i++) {
if (array[i] == 0)
continue;
for (j = 0; j < N_TIMES; j++) {
d += array[i];
}
}
That's untested, because I doubt it would be acceptable, but the pattern, skipping to the next iteration of the loop in a common case to avoid subsequent unnecessary instructions, is a common optimization practice.
In an unoptimized compiler, using a pointer may be faster then an index.
IE, you could loop with:
double * arrayend = array + (ARRAY_SIZE - 1);
double * valuep;
for(valuep = array; valuep <= arrayend; valuep++) {
//inner stuff
}
!= may be faster then < , though don't use equality for comparing non integers.
Using unsigned numbers MAY be faster, though probably not in your case. Signed vs Unsigned operations in C
Integers are probably faster then doubles, but may not be big enough for actual math.

Calloc sets all elements in the array to zero (actually all bits are set to zero). So you are really adding zero a bunch of times.
So let me run through some ways to potentially go faster, beyond what you are doing (which is good, you are avoiding comparisons, though if your array size wasn't a multiple of 20 or whatever, you would have issues).
It may or may not be slightly faster to initialize your array statically and set the values to zero;
double array[ARRAY_SIZE] = {0};
Technically {} should work,but {0} is probably more explicit.
a for loop will reinitialize j to 0 every time. Declare int j outside of both loops and you probably save ARRAY_SIZE operations.
In general if the numbers in an array follow some arithmetical sequence, it may be possible to reduce the loop into an equation.
For example Carl Friedrich Gauss supposedly figured out as a child that if your sequence is 1 2 3 4 .. n (n is the last number) then the sum is (n * (n + 1)) / 2
If n is 4, 1 + 2 + 3 + 4 = 10 and (4 * 5) /2 also equals ten.
There are other sequences, like the sum of consecutive squared numbers IE (1^2 + 2^2 + 3^2 + 4^2.. n^2). Read https://en.wikipedia.org/wiki/Square_pyramidal_number for more on that.
Anyway my point is understanding math is important to optimization.
In your case all your numbers are the same, which means you could just multiply the number by ARRAY_SIZE and N_TIMES. The only time where this would maybe give a different answer is where you would overflow the max size of a double. Further, they are all 0, so that you don't ever have to do that;
You could potentially do something like:
int i, j;
double d;
for (i = 0; i < ARRAY_SIZE; i++) {
if (array[i] == 0)
continue;
for (j = 0; j < N_TIMES; j++) {
d += array[i];
}
}
That's untested, because I doubt it would be acceptable, but the pattern, skipping to the next iteration of the loop in a common case to avoid subsequent unnecessary instructions, is a common optimization practice.
In an unoptimized compiler, using a pointer may be faster then an index.
IE, you could loop with:
double * arrayend = array + (ARRAY_SIZE - 1);
double * valuep;
for(valuep = array; valuep <= arrayend; valuep++) {
//inner stuff
}
!= may be faster then < , though don't use equality for comparing non integers.
Using unsigned numbers MAY be faster, though probably not in your case. Signed vs Unsigned operations in C
Integers are probably faster then doubles, but may not be big enough for actual math.
edit: also another one. if you know the size of the cache value of the system, you can potentially optimize for that.

Related

Tiled Matrix Multiplication using AVX

I have coded the following C function for multiplying two NxN matrices using tiling/blocking and AVX vectors to speed up the calculation. Right now though I'm getting a segmentation fault when I try to combine AVX intrinsics with tiling. Any idea why that happens?
Also, is there a better memory access pattern for matrix B? Maybe transposing it first or even changing the k and j loop? Because right now, I'm traversing it column-wise which is probably not very efficient in regards to spatial locality and cache lines.
1 void mmult(double A[SIZE_M][SIZE_N], double B[SIZE_N][SIZE_K], double C[SIZE_M][SIZE_K])
2 {
3 int i, j, k, i0, j0, k0;
4 // double sum;
5 __m256d sum;
6 for(i0 = 0; i0 < SIZE_M; i0 += BLOCKSIZE) {
7 for(k0 = 0; k0 < SIZE_N; k0 += BLOCKSIZE) {
8 for(j0 = 0; j0 < SIZE_K; j0 += BLOCKSIZE) {
9 for (i = i0; i < MIN(i0+BLOCKSIZE, SIZE_M); i++) {
10 for (j = j0; j < MIN(j0+BLOCKSIZE, SIZE_K); j++) {
11 // sum = C[i][j];
12 sum = _mm256_load_pd(&C[i][j]);
13 for (k = k0; k < MIN(k0+BLOCKSIZE, SIZE_N); k++) {
14 // sum += A[i][k] * B[k][j];
15 sum = _mm256_add_pd(sum, _mm256_mul_pd(_mm256_load_pd(&A[i][k]), _mm256_broadcast_sd(&B[k][j])));
16 }
17 // C[i][j] = sum;
18 _mm256_store_pd(&C[i][j], sum);
19 }
20 }
21 }
22 }
23 }
24 }
_mm256_load_pd is an alignment-required load but you're only stepping by k++, not k+=4 in the inner-most loop that loads a 32-byte vector of 4 doubles. So it faults because 3 of every 4 loads are misaligned.
You don't want to be doing overlapping loads, your real bug is the indexing; if your input pointers are 32-byte aligned you should be able to keep using _mm256_load_pd instead of _mm256_loadu_pd. So using _mm256_load_pd successfully caught your bug instead of working but giving numerically wrong results.
Your strategy for vectorizing four row*column dot products (to produce a C[i][j+0..3] vector) should load 4 contiguous doubles from 4 different columns (B[k][j+0..3] via a vector load from B[k][j]), and broadcast 1 double from A[i][k]. Remember you want 4 dot products in parallel.
Another strategy might involve a horizontal sum at the end down to a scalar C[i][j] += horizontal_add(__m256d), but I think that would require transposing one input first so both row and column vectors are in contiguous memory for one dot product. But then you need shuffles for a horizontal sum at the end of each inner loop.
You probably also want to use at least 2 sum variables so you can read a whole cache line at once, and hide FMA latency in the inner loop and hopefully bottleneck on throughput. Or better do 4 or 8 vectors in parallel. So you produce C[i][j+0..15] as sum0, sum1, sum2, sum3. (Or use an array of __m256d; compilers will typically fully unroll a loop of 8 and optimize the array into registers.)
I think you only need 5 nested loops, to block over rows and columns. Although apparently 6 nested loops are a valid option: see loop tiling/blocking for large dense matrix multiplication which has a 5-nested loop in the question but a 6-nested loop in an answer. (Just scalar, though, not vectorized).
There might be other bugs besides the row*column dot product strategy here, I'm not sure.
If you're using AVX, you might want to use FMA as well, unless you need to run on Sandbybridge/Ivybridge, and AMD Bulldozer. (Piledriver and later have FMA3).
Other matmul strategies include adding into the destination inside the inner loop so you're loading C and A inside the inner loop, with a load from B hoisted. (Or B and A swapped, I forget.) What Every Programmer Should Know About Memory? has a vectorized cache-blocked example that works this way in an appendix, for SSE2 __m128d vectors. https://www.akkadia.org/drepper/cpumemory.pdf

Matrix multiplication in 2 different ways (comparing time)

I've got an assignment - compare 2 matrix multiplications - in the default way, and multiplication after transposition of second matrix, we should point the difference which method is faster. I've written something like this below, but time and time2 are nearly equal to each other. In one case the first method is faster, I run the multiplication with the same size of matrix, and in another one the second method is faster. Is something done wrong? Should I change something in my code?
clock_t start = clock();
int sum;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
sum = 0;
for(int k=0; k<size; ++k) {
sum = sum + (m1[i][k] * m2[k][j]);
}
score[i][j] = sum;
}
}
clock_t end = clock();
double time = (end-start)/(double)CLOCKS_PER_SEC;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
int temp = m2[i][j];
m2[i][j] = m2[j][i];
m2[j][i] = temp;
}
}
clock_t start2 = clock();
int sum2;
for(int i=0; i<size; ++i) {
for(int j=0; j<size; ++j) {
sum2 = 0;
for(int k=0; k<size; ++k) {
sum2 = sum2 + (m1[k][i] * m2[k][j]);
}
score[i][j] = sum2;
}
}
clock_t end2 = clock();
double time2 = (end2-start2)/(double)CLOCKS_PER_SEC;
You have multiple severe issues with your code and/or your understanding. Let me try to explain.
Matrix multiplication is bottlenecked by the rate at which the processor can load and store the values to memory. Most current architectures use cache to help with this. Data is moved from memory to cache and from cache to memory in blocks. To maximize the benefit of caching, you want to make sure you will use all the data in that block. To do that, you make sure you access the data sequentially in memory.
In C, multi-dimensional arrays are specified in row-major order. It means that the rightmost index is consecutive in memory; i.e. that a[i][k] and a[i][k+1] are consecutive in memory.
Depending on the architecture, the time it takes for the processor to wait (and do nothing) for the data to be moved from RAM to cache (and vice versa), may or may not be included in the CPU time (that e.g. clock() measures, albeit at a very poor resolution). For this kind of measurement ("microbenchmark"), it is much better to measure and report both CPU and real (or wall clock) time used; especially so if the microbenchmark is run on different machines, to get a better idea of the practical impact of the change.
There will be a lot of variation, so typically, you measure the time taken by a few hundred repeats (each repeat possibly making more than one operation; enough to be easily measured), storing the duration of each, and report their median. Why median, and not minimum, maximum, average? Because there will always be occasional glitches (unreasonable measurement due to an external event, or something), which typically yield a much higher value than normal; this makes the maximum uninteresting, and skews the average (mean) unless removed. The minimum is typically an over-optimistic case, where everything just happened to go perfectly; that rarely occurs in practice, so is only a curiosity, not of practical interest. The median time, on the other hand, gives you a practical measurement: you can expect 50% of all runs of your test case to take no more than the median time measured.
On POSIXy systems (Linux, Mac, BSDs), you should use clock_gettime() to measure the time. The struct timespec format has nanosecond precision (1 second = 1,000,000,000 nanoseconds), but resolution may be smaller (i.e., the clocks change by more than 1 nanosecond, whenever they change). I personally use
#define _POSIX_C_SOURCE 200809L
#include <time.h>
static struct timespec cpu_start, wall_start;
double cpu_seconds, wall_seconds;
void timing_start(void)
{
clock_gettime(CLOCK_REALTIME, &wall_start);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &cpu_start);
}
void timing_stop(void)
{
struct timespec cpu_end, wall_end;
clock_gettime(CLOCK_REALTIME, &wall_end);
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &cpu_end);
wall_seconds = (double)(wall_end.tv_sec - wall_start.tv_sec)
+ (double)(wall_end.tv_nsec - wall_start.tv_nsec) / 1000000000.0;
cpu_seconds = (double)(cpu_end.tv_sec - cpu_start.tv_sec)
+ (double)(cpu_end.tv_nsec - cpu_start.tv_nsec) / 1000000000.0;
}
You call timing_start() before the operation, and timing_stop() after the operation; then, cpu_seconds contains the amount of CPU time taken and wall_seconds the real wall clock time taken (both in seconds, use e.g. %.9f to print all meaningful decimals).
The above won't work on Windows, because Microsoft does not want your C code to be portable to other systems. It prefers to develop their own "standard" instead. (Those C11 "safe" _s() I/O function variants are a stupid sham, compared to e.g. POSIX getline(), or the state of wide character support on all systems except Windows.)
Matrix multiplication is
c[r][c] = a[r][0] * b[0][c]
+ a[r][1] * b[1][c]
: :
+ a[r][L] * b[L][c]
where a has L+1 columns, and b has L+1 rows.
In order to make the summation loop use consecutive elements, we need to transpose b. If B[c][r] = b[r][c], then
c[r][c] = a[r][0] * B[c][0]
+ a[r][1] * B[c][1]
: :
+ a[r][L] * B[c][L]
Note that it suffices that a and B are consecutive in memory, but separate (possibly "far" away from each other), for the processor to utilize cache efficiently in such cases.
OP uses a simple loop, similar to the following pseudocode, to transpose b:
For r in rows:
For c in columns:
temporary = b[r][c]
b[r][c] = b[c][r]
b[c][r] = temporary
End For
End For
The problem above is that each element participates in a swap twice. For example, if b has 10 rows and columns, r = 3, c = 5 swaps b[3][5] and b[5][3], but then later, r = 5, c = 3 swaps b[5][3] and b[3][5] again! Essentially, the double loop ends up restoring the matrix to the original order; it does not do a transpose.
Consider the following entries and the actual transpose:
b[0][0] b[0][1] b[0][2] b[0][0] b[1][0] b[2][0]
b[1][0] b[1][1] b[1][2] ⇔ b[0][1] b[1][1] b[2][1]
b[2][0] b[2][1] b[2][2] b[0][2] b[1][2] b[2][2]
The diagonal entries are not swapped. You only need to do the swap in the upper triangular portion (where c > r) or in the lower triangular portion (where r > c), to swap all entries, because each swap swaps one entry from the upper triangular to the lower triangular, and vice versa.
So, to recap:
Is something done wrong?
Yes. Your transpose does nothing. You haven't understood the reason why one would want to transpose the second matrix. Your time measurement relies on a low-precision CPU time, which may not reflect the time taken by moving data between RAM and CPU cache. In the second test case, with m2 "transposed" (except it isn't, because you swap each element pair twice, returning them back to the way they were), your innermost loop is over the leftmost array index, which means it calculates the wrong result. (Moreover, because consecutive iterations of the innermost loop accesses items far from each other in memory, it is anti-optimized: it uses the pattern that is worst in terms of speed.)
All the above may sound harsh, but it isn't intended to be, at all. I do not know you, and I am not trying to evaluate you; I am only pointing out the errors in this particular answer, in your current understanding, and only in the hopes that it helps you, and anyone else encountering this question in similar circumstances, to learn.

How to use AVX/SIMD with nested loops and += format?

I am writing a page rank program. I am writing a method for updating the rankings. I have successful got it working with nested for loops and also a threaded version. However I would like to instead use SIMD/AVX.
This is the code I would like to change into a SIMD/AVX implementation.
#define IDX(a, b) ((a * npages) + b) // 2D matrix indexing
for (size_t i = 0; i < npages; i++) {
temp[i] = 0.0;
for (size_t j = 0; j < npages; j++) {
temp[i] += P[j] * matrix_cap[IDX(i,j)];
}
}
For this code P[] is of size npages and matrix_cap[] is of size npages * npages. P[] is the ranks of the pages and temp[] is used to store the next iterations page ranks so as to be able to check convergence.
I don't know how to interpret += with AVX and how I would get my data which involves two arrays/vectors of size npages and one matrix of size npages * npages (in row major order) into a format of which could be used with SIMD/AVX operations.
As far as AVX this is what I have so far though it's very very incorrect and was just a stab at what I would roughly like to do.
ssize_t g_mod = npages - (npages % 4);
double* res = malloc(sizeof(double) * npages);
double sum = 0.0;
for (size_t i = 0; i < npages; i++) {
for (size_t j = 0; j < mod; j += 4) {
__m256d p = _mm256_loadu_pd(P + j);
__m256d m = _mm256_loadu_pd(matrix_hat + i + j);
__m256d pm = _mm256_mul_pd(p, m);
_mm256_storeu_pd(&res + j, pm);
for (size_t k = 0; k < 4; k++) {
sum += res[j + k];
}
}
for (size_t i = mod; i < npages; i++) {
for (size_t j = 0; j < npages; j++) {
sum += P[j] * matrix_cap[IDX(i,j)];
}
}
temp[i] = sum;
sum = 0.0;
}
How to can I format my data so I can use AVX/SIMD operations (add,mul) on it to optimise it as it will be called a lot.
Consider using OpenMP4.x #pragma omp simd reduction for innermost loop. Take in mind that omp reductions are not applicable to C++ arrays, therefore you have to use temporary reduction variable like shown below.
#define IDX(a, b) ((a * npages) + b) // 2D matrix indexing
for (size_t i = 0; i < npages; i++) {
my_type tmp_reduction = 0.0; // was: // temp[i] = 0.0;
#pragma omp simd reduction (+:tmp_reduction)
for (size_t j = 0; j < npages; j++) {
tmp_reduction += P[j] * matrix_cap[IDX(i,j)];
}
temp[i] = tmp_reduction;
}
For x86 platforms, OpenMP4.x is currently supported by fresh GCC (4.9+) and Intel Compilers. Some LLVM and PGI compilers may also support it.
P.S. Auto-vectorization ("auto" means vectorization by compiler without any pragmas, i.e. without explicit gudiance from developers) may sometimes work for some compiler variants (although it's very unlikely due to array element as reduction variable). However it is strictly speaking incorrect to auto-vectorize this code. You have to use explicit SIMD pragma to "resolve" reduction dependency and (as a good side-effect) disambiguate pointers (in case arrays are accessed via pointer).
First, EOF is right, you should see how well gcc/clang/icc do at auto-vectorizing your scalar code. I can't check for you, because you only posted code-fragments, not anything I can throw on http://gcc.godbolt.org/.
You definitely don't need to malloc anything. Notice that your intrinsics version only ever uses 32B at a time of res[], and always overwrites whatever was there before. So you might as well use a single 32B array. Or better, use a better method to get a horizontal sum of your vector.
(see the bottom for a suggestion on a different data arrangement for the matrix)
Calculating each temp[i] uses every P[j], so there is actually something to be gained from being smarter about vectorizing. For every load from P[j], use that vector with 4 different loads from matrix_cap[] for that j, but 4 different i values. You'll accumulate 4 different vectors, and have to hsum each of them down to a temp[i] value at the end.
So your inner loop will have 5 read streams (P[] and 4 different rows of matrix_cap). It will do 4 horizontal sums, and 4 scalar stores at the end, with the final result for 4 consecutive i values. (Or maybe do two shuffles and two 16B stores). (Or maybe transpose-and-sum together, which is actually a good use-case for the shuffling power of the expensive _mm256_hadd_pd (vhaddpd) instruction, but be careful of its in-lane operation)
It's probably even better to accumulate 8 to 12 temp[i] values in parallel, so every load from P[j] is reused 8 to 12 times. (check the compiler output to make sure you aren't running out of vector regs and spilling __m256d vectors to memory, though.) This will leave more work for the cleanup loop.
FMA throughput and latency are such that you need 10 vector accumulators to keep 10 FMAs in flight to saturate the FMA unit on Haswell. Skylake reduced the latency to 4c, so you only need 8 vector accumulators to saturate it on SKL. (See the x86 tag wiki). Even if you're bottlenecked on memory, not execution-port throughput, you will want multiple accumulators, but they could all be for the same temp[i] (so you'd vertically sum them down to one vector, then hsum that).
However, accumulating results for multiple temp[i] at once has the large advantage of reusing P[j] multiple times after loading it. You also save the vertical adds at the end. Multiple read streams may actually help hide the latency of a cache miss in any one of the streams. (HW prefetchers in Intel CPUs can track one forward and one reverse stream per 4k page, IIRC). You might strike a balance, and use two or three vector accumulators for each of 4 temp[i] results in parallel, if you find that multiple read streams are a problem, but that would mean you'd have to load the same P[j] more times total.
So you should do something like
#define IDX(a, b) ((a * npages) + b) // 2D matrix indexing
for (size_t i = 0; i < (npages & (~7ULL)); i+=8) {
__m256d s0 = _mm256_setzero_pd(),
s1 = _mm256_setzero_pd(),
s2 = _mm256_setzero_pd(),
...
s7 = _mm256_setzero_pd(); // 8 accumulators for 8 i values
for (size_t j = 0; j < (npages & ~(3ULL)); j+=4) {
__m256d Pj = _mm256_loadu_pd(P+j); // reused 8 times after loading
//temp[i] += P[j] * matrix_cap[IDX(i,j)];
s0 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+0,j)]), s0);
s1 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+1,j)]), s1);
// ...
s7 = _mm256_fmadd_pd(Pj, _mm256_loadu_pd(&matrix_cap[IDX(i+7,j)]), s7);
}
// or do this block with a hsum+transpose and do vector stores.
// taking advantage of the power of vhaddpd to be doing 4 useful hsums with each instructions.
temp[i+0] = hsum_pd256(s0); // See the horizontal-sum link earlier for how to write this function
temp[i+1] = hsum_pd256(s1);
//...
temp[i+7] = hsum_pd256(s7);
// if npages isn't a multiple of 4, add the last couple scalar elements to the results of the hsum_pd256()s.
}
// TODO: cleanup for the last up-to-7 odd elements.
You could probably write __m256d sums[8] and loop over your vector accumulators, but you'd have to check that the compiler fully unrolls it and still actually keeps everything live in registers.
How to can I format my data so I can use AVX/SIMD operations (add,mul) on it to optimise it as it will be called a lot.
I missed this part of the question earlier. First of all, obviously float will and give you 2x the number of elements per vector (and per unit of memory bandwidth). The factor of 2 less memory / cache footprint might give more speedup than that if cache hit rate increases.
Ideally, the matrix would be "striped" to match the vector width. Every load from the matrix would get a vector of matrix_cap[IDX(i,j)] for 4 adjacent i values, but the next 32B would be the next j value for the same 4 i values. This means that each vector accumulator is accumulating the sum for a different i in each element, so no need for horizontal sums at the end.
P[j] stays linear, but you broadcast-load each element of it, for use with 8 vectors of 4 i values each (or 8 vec of 8 is for float). So you increase your reuse factor for P[j] loads by a factor of the vector width. Broadcast-loads are near-free on Haswell and later (still only take a load-port uop), and plenty cheap for this on SnB/IvB where they also take a shuffle-port uop.

Is vectorization profitable in this case?

I broke a kernel down to several loops, in order to vectorize each one of them afterwards. One of this loops looks like:
int *array1; //Its size is "size+1";
int *array2; //Its size is "size+1";
//All positions of array1 and array2 are set to 0 here;
int *sArray1 = array1+1; //Shift one position so I start writing on pos 1
int *sArray2 = array2+1; //Shift one position so I start writing on pos 1
int bb = 0;
for(int i=0; i<size; i++){
if(A[i] + bb > B[i]){
bb = 1;
sArray1[i] = S;
sArray2[i] = 1;
}
else
bb = 0;
}
Please note the loop-carried dependency, in bb - each comparison depends upon bb's value, which is modified on the previous iteration.
What I thought about:
I can be absolutely certain of some cases. For example, when A[i] is already greater than B[i], I do not need to know what value bb carries from the previous iteration;
When A[i] equals B[i], I need to know what value bb carries from the previous iteration. However, I also need to account for the case when this happens in two consecutive positions; When I started to shape up these cases, it seemed that these becomes overly complicated and vectorization doesn't pay off.
Essentially, I'd like to know if this can be vectorized in an effective manner or if it is simply better to run this without any vectorization whatsoever.
You might not want to iterate over single elements, but have a loop over the chunks (where a chunk is defined by all elements within yielding the same bb).
The search for chunk boundraries could be vectorized (by hand using compiler specific SIMD intrinics probably).
And the action to be taken for single chunk of bb=1 could be vectorized, too.
The loop transformation is as follows:
size_t i_chunk_start = 0, i_chunk_end;
int bb_chunk = A[0] > B[0] ? 1 : 0;
while (i_chunk_start < isize) {
if(bb_chunk) {
/* find end of current chunk */
for (i_chunk_end = i_chunk_start + 1; i_chunk_end < isize; ++i_chunk_end) {
if(A[i_chunk_end] < B[i_chunk_end]) {
break;
}
}
/* process current chunk */
for(size_t i = i_chunk_start; i < i_chunk_end; ++i) {
sArray1[i] = S;
sArray2[i] = 1;
}
bb_chunk = 0;
} else {
/* find end of current chunk */
for (i_chunk_end = i_chunk_start + 1; i_chunk_end < isize; ++i_chunk_end) {
if(A[i_chunk_end] > B[i_chunk_end]) {
break;
}
}
bb_chunk = 1;
}
/* prepare for next chunk */
i_chunk_start = i_chunk_end;
}
Now, each of the inner loops (all for loops) could potentially get vectorized.
Whether or not vectorization in this manner is superior to non-vectorization depends on whether the chunks have sufficient length in average. You will only find out by benchmarking.
The effect of your loop body depends on two conditions:
A[i] > B[i]
A[i] + 1 > B[i]
Their calculation can be vectorized easily. Assuming int has 32 bits, and vectorized instructions work on 4 int values at a time, there are 8 bits per vectorized iteration (4 bits for each condition).
You can harvest those bits from a SSE register by _mm_movemask_epi8. It's a bit inconvenient that it works on bytes and not on ints, but you can take care of it by a suitable shuffle.
Afterwards, use the 8 bits as an address to a LUT (of 256 entries), which stores 4-bit masks. These masks can be used to store the elements into destination conditionally, using _mm_maskmoveu_si128.
I am not sure such a complicated program is worthwhile - it involves much bit-fiddling for just x4 improvement in speed. Maybe it's better to build the masks by examining the decision bits individually. But vectorizing your comparisons and stores seems worthwhile in any case.

Manually optimize a nested loop

I'm working on a homework assignment where I must manually optimize a nested loop (my program will be compiled with optimizations disabled). The goal of the assignment is to run the entire program in less than 6 seconds (extra credit for less than 4.5 seconds).
I'm only allowed to change a small block of code, and the starting point is such:
for (j=0; j < ARRAY_SIZE; j++) {
sum += array[j];
}
Where ARRAY_SIZE is 9973. This loop is contained within another loop that is run 200,000 times. This particular version runs in 16 seconds.
What I've done so far is change the implementation to unroll the loop and use pointers as my iterator:
(These declarations are not looped over 200,000 times)
register int unroll_length = 16;
register int *unroll_end = array + (ARRAY_SIZE - (ARRAY_SIZE % unroll_length));
register int *end = array + (ARRAY_SIZE -1);
register int *curr_end;
curr_end = end;
while (unroll_end != curr_end) {
sum += *curr_end;
curr_end--;
}
do {
sum += *curr_end + *(curr_end-1) + *(curr_end-2) + *(curr_end-3) +
*(curr_end-4) + *(curr_end-5) + *(curr_end-6) + *(curr_end-7) +
*(curr_end-8) + *(curr_end-9) + *(curr_end-10) + *(curr_end-11) +
*(curr_end-12) + *(curr_end-13) + *(curr_end-14) + *(curr_end-15);
}
while ((curr_end -= unroll_length) != array);
sum += *curr_end;
Using these techniques, I was able to get the execution down to 5.5 seconds, which will give me full credit. However; I sure do want to earn the extra credit, but I'm also curious what additional optimizations I can make that I might be overlooking?
Edit #1 (Adding outer loop)
srand(time(NULL));
for(j = 0; j < ARRAY_SIZE; j++) {
x = rand() / (int)(((unsigned)RAND_MAX + 1) / 14);
array[j] = x;
checksum += x;
}
for (i = 0; i < N_TIMES; i++) {
// inner loop goes here
if (sum != checksum)
printf("Checksum error!\n");
sum = 0;
}
you could try to store your variables in CPU register with :
register int *unroll_limit = array + (ARRAY_SIZE - (ARRAY_SIZE % 10));
register int *end = array + ARRAY_SIZE;
register int *curr;
and try with different size of manual loops to check when you maximize cache usage.
I'm going to assume you're on x86, if you're not most of this will still apply but the details differ.
Use SIMD/SSE, this will get you a 4x speed increase without much effort, it needs 16-byte aligned data that you can get with _aligned_malloc or regular malloc + manual alignment. Besides that all you'll need in this case is _mm_add_epi32 to do four additions at the same time. (Different architectures have different SIMD units so check yours).
Use multi-threading/ multiple cores in this case it'd be easiest to have each thread sum half the array to a temporary variable and sum those two results when done. This will scale linearly across the number of cores available.
Prefetch to L1 cache; this only works when you've got a huge array and are sure to be able to stress the CPU for at least ~200 cycles (eg. a roundtrip to main RAM).
Completely go out of your way to optimize the hell out of it and use a GPU based approach. This will require you to set up a CUDA or OpenCL environment and upload the array to the GPU. This is about ~400 LoC excluding the compute kernel. But might not be worth it if you have a small dataset (eg. too much overhead in setting up/tearing down) or if you have a huge changing dataset (eg. too much time spend in streaming to the GPU).
Align to page boundaries to prevent page-faults (expensive) on windows these are usually 4K in size.
Manually unroll the loop while taking into account dual issuing instructions and instruction latencies. This information is available from your CPU manufacturer (Intel provides these too). But on x86 this isn't really useful because of it's CPUs out of order execution.
Depending on your platform actually getting the data to the CPU for processing is the slowest part (this is mainly true for recent consoles & PS, I've never developed for small embedded devices) so you'll want to optimize for that. Tricks like iterating backwards are nice on a 6502 when cycles were the bottleneck but these days you'll want to access RAM linearly.
If you do happen to be on a machine with fast RAM (eg. NOT PC/Consoles), converting from the plain array to a more fancy data-structure (eg. one that does more pointer chasing) might totally be worth it.
All in all, I guess that 1 & 2 are easiest and most feasible and will gain you more than enough performance (eg. 8x on a Core 2 Duo). However, it all comes down to knowing your hardware and programming PIC will require completely different optimizations (eg. instruction level manual pipelining) than a general PC will.
Try to align the array on a page boundary ( i.e. 4K )
Try to compute with a wider data type, i.e. 64 bit- instead of 32-bit integers. This way you can add 2 numbers at once. As the final step add up the both halves.
Convert part of the array or the computation to floating point, so you can use FPU and CPU in parallel
I don't expect the following suggestions to be allowed but I mention them anyway
Multithreading
Specialized CPU-Instructions, i.e. SSE
If the array values don't change, you could memoize the sum (i.e. calculate it on first run, and use the calculated sum on subsequent runs).
Some nice optimization tricks:
make your loop count backwards from ARRAY_SIZE to 0 so that way you can remove the comparisons from your code. Less comparisons speed up the program.
Furthermore x86 nowadays are optimized for short loops which they can "preload" to run faster then normal.
Try to use registers wherever possible
Use pointers instead of array indices
So if you would use arrays, try to use:
register int idx = ARRAY_SIZE - 1;
register int sum = 0;
do {
sum += array[idx];
} while (idx-- % 10 != 0);
do {
sum += array[idx] + array[idx - 1] + array[idx - 2] + array[idx - 3] + array[idx - 4] + array[idx - 5] + array[idx - 6] + array[idx - 7] + array[idx - 8] + array[idx - 9];
} while (idx -= 10);
// now we don't use a comparison and the ZERO flag will be set in FLAG
// register on which we can conditional jump. With a comparison you do VALUE - VALUE
// and then check if the ZERO flag is set or the NEGATIVE flag or whatever you are testing on

Resources