Unrolling Loops (C) - c

I understand the concept of unrolling loops however, can someone explain to me how to unroll a simple loop?
It would be great if you would show me a loop, and then a unrolled version of that loop with explanations of what is happening.

I think it's important to clarify when loop unrolling is most effective: with dependency chains. A dependency chain is a series of operations where each calculation depends on the previous calculation. For example, the following loop has a dependency chain.
for(i=0; i<n; i++) sum += a[i];
Most modern processors can do multiple out-of-order operations per cycle. This increases the instruction throughput. However, out-of-order operations can't do this in a dependency chain. In the loop above each calculation is bound by the latency of the addition operation.
In the loop above we can unroll it into two dependency chains like this
sum1 = 0, sum2 = 0;
for(i=0; i<n/2; i+=2) sum1 += a[2*i], sum2 += a[2*i+1];
for(i=(n/2)*2; i<n; i++) sum += a[i]; // clean up for n odd
sum += sum1 + sum2;
Now an out-of-order processor could operate on either chain independently and depending on the processor simultaneously.
In general you should unroll by an amount equal to the latency of the operation times the number of those operations that can be done per clock cycle. For example with a x86_64 processor it can perform at least one SSE addition per clock cycle and the SSE addition has a latency of 3 so you should unroll three times. With a Haswell processor it can do two FMA operations per clock cycle and each FMA operations has a latency of 5 so you would need to unroll 10 times to get the maximum throughput.
As far as compilers go GCC does not unroll dependency chains (even with -funroll-loops). You have to unroll yourself with GCC. With Clang it unrolls four times which is generally pretty good (in some cases on Haswell and Broadwell you would need to unroll 10 times and with Skylake 8 times).
Another reason to unroll is when the number of operations in a loop exceeds the number of instructions which can be push through per clock cycle. For example in the following loop
for(i=0; i<n; i++) b[i] += 3.14159*a[i];
there is no dependency chain so there is no problem with out-of-order execution. But let's consider an instruction set which needs the following operations per iteration.
2 SIMD load
1 SIMD store
1 SIMD multiply
1 SIMD addition
1 scalar addition for the loop counter
1 conditional jump
Let's also assume the the processor can push through five of these instructions per cycle. In this case there are seven instructions per iteration but only five can be done per cycle. Loop unrolling can then be used to amortize the cost of the scalar addition to the counter i and the conditional jump. For example if you fully unrolled the loop these instruction would not be necessary.
For amortizing the cost of the loop counter and jump -funroll-loops works fine with GCC . It unrolls eight times which means the counter addition and jump has to be done once every eight iteration instead of every iteration.

The process of unrolling loops utilizes an essential concept in computer science: the space-time tradeoff, where increasing the space used can often lead to decreasing the time of an algorithm.
Let's say we have a simple loop,
const int n = 1000;
for (int i = 0; i < n; ++i) {
foo();
}
This is compiled to assembly looking something like this:
mov eax, 0
loop:
call foo
inc eax
cmp eax, 1000
jne loop
So the space-time trade-off is 5 lines of assembly for ~(4 * 1000) = ~4000 instructions executed.
So, let's try and unroll the loop a bit.
for (int i = 0; i < n; i += 10) {
foo();
foo();
foo();
foo();
foo();
foo();
foo();
foo();
foo();
foo();
}
And its assembly:
mov eax, 0
loop:
call foo
call foo
call foo
call foo
call foo
call foo
call foo
call foo
call foo
call foo
add eax, 10
cmp eax, 1000
jne loop
The space-time trade-off is 14 lines of assembly for ~(14 * 100) = ~1400 instructions executed.
We can do a total unrolling, like this:
foo();
foo();
// ...
// 996 foo()'s
// ...
foo();
foo();
Which compiles in assembly as 1000 call instructions.
This gives a space-time trade-off of 1000 lines of assembly for 1000 instructions.
As you can see, the general trend is that to reduce the amount of instructions executed by the CPU, we must increase the space required.
It is not efficient to totally unroll a loop, as the space required becomes extremely large. Partial unrolling gives huge benefits with greatly diminishing returns the more you unroll the loop.
While it's a good idea to understand loop unrolling, keep in mind that the compiler is smart and will do it for you.

Rolled (regular):
#define N 44
int main() {
int A[N], B[N];
int i;
// fill A with stuff ...
for(i = 0; i < N; i++) {
B[i] = A[i] * (100 % i);
}
// do stuff with B ...
}
Unrolled:
#define N 44
int main() {
int A[N], B[N];
int i;
// fill A with stuff ...
for(i = 0; i < N; i += 4) {
B[i] = A[i] * (100 % i);
B[i+1] = A[i+1] * (100 % i+1);
B[i+2] = A[i+2] * (100 % i+2);
B[i+3] = A[i+3] * (100 % i+3);
}
// do stuff with B ...
}
Unrolling can potentially increase performance at the cost of a larger program size. Performance increases could be due to a reduction in branch penalties, cache misses and execution instructions. Some disadvantages are obvious, like an increase in the amount of code and a decrease in readability, and some are not so obvious.

Related

Make out-of-order CPU run instructions in-order

Consider the loop:
for (int i = 0; i < n; i++) {
sum += a[i];
}
An out-of-order CPU can execute many instructions in advance, it can e.g. have 20 parallel pending loads of a[i] from 20 different iterations of the loop.
But, for me, this is a hindrance. I want that the CPU works like an in-order CPU. I want it to not start a load in the next iteration until it has finished the load in the current iteration.
The reason I want this is very simple: I want to save the memory bandwidth for other processes running on other CPU core. This process is low priority, and I want to limit is as much as possible even though it will get slower.
Two techniques come to mind: fake loop-carried dependencies and memory barriers.
For fake dependencies, something like this can be used:
double* a_current = a;
for (int i = 0; i < n; i++) {
volatile int a_val = *a_current;
sum += a_val;
a_current += 1 + (a_val - a_val);
}
This is horrible code and I wonder if there is something better.
About memory barriers, I don't know almost anything. What could be useful there?

Why cannot my program reach integer addition instruction throughput bound?

I have read chapter 5 of CSAPP 3e. I want to test if the optimization techniques described in the book can work on my computer. I write the following program:
#define SIZE (1024)
int main(int argc, char* argv[]) {
int sum = 0;
int* array = malloc(sizeof(int) * SIZE);
unsigned long long before = __rdtsc();
for (int i = 0; i < SIZE; ++i) {
sum += array[i];
}
unsigned long long after = __rdtsc();
double cpe = (double)(after - before) / SIZE;
printf("CPE is %f\n", cpe);
printf("sum is %d\n", sum);
return 0;
}
and it reports the CPE is around 1.00.
I transform the program using the 4x4 loop unrolling technique and it leads to the following program:
#define SIZE (1024)
int main(int argc, char* argv[]) {
int sum = 0;
int* array = malloc(sizeof(int) * SIZE);
int sum0 = 0;
int sum1 = 0;
int sum2 = 0;
int sum3 = 0;
/* 4x4 unrolling */
unsigned long long before = __rdtsc();
for (int i = 0; i < SIZE; i += 4) {
sum0 += array[i];
sum1 += array[i + 1];
sum2 += array[i + 2];
sum3 += array[i + 3];
}
unsigned long long after = __rdtsc();
sum = sum0 + sum1 + sum2 + sum3;
double cpe = (double)(after - before) / SIZE;
printf("CPE is %f\n", cpe);
printf("sum is %d\n", sum);
return 0;
}
Note that I omit the code to handle the situation when SIZE is not a multiple of 4. This program reports the CPE is around 0.80.
My program runs on an AMD 5950X, and according to AMD's software optimization manual (https://developer.amd.com/resources/developer-guides-manuals/), the integer addition instruction has a latency of 1 cycle and throughput of 4 instructions per cycle. It also has a load-store unit which could execute three independent load operations at the same time. My expectation of the CPE is 0.33, and I do not know why the result is so much higher.
My compiler is gcc 12.2.0. All programs are compiled with flags -Og.
I checked the assembly code of the optimized program, but found nothing helpful:
.L4:
movslq %r9d, %rcx
addl (%r8,%rcx,4), %r11d
addl 4(%r8,%rcx,4), %r10d
addl 8(%r8,%rcx,4), %ebx
addl 12(%r8,%rcx,4), %esi
addl $4, %r9d
.L3:
cmpl $127, %r9d
jle .L4
I assume at least 3 of the 4 addl instructions should execute in parallel. However, the result of the program does not meet my expectation.
cmpl $127, %r9d is not a large iteration count compared to rdtsc overhead and the branch mispredict when you exit the loop, and time for the CPU to ramp up to max frequency.
Also, you want to measure core clock cycles, not TSC reference cycles. Put the loop in a static executable (for minimal startup overhead) and run it with perf stat to get core clocks for the whole process. (As in Can x86's MOV really be "free"? Why can't I reproduce this at all? or some perf experiments I've posted in other answers.)
See Idiomatic way of performance evaluation?
10M to 1000M total iterations is appropriate since that's still under a second and we only want to measure steady-state behaviour, not cold-cache or cold-branch-predictor effect. Or page-faults. Interrupt overhead tends to be under 1% on an idle system. Use perf stat --all-user to only count user-space cycles and instructions.
If you want to do it over an array (instead of just removing the pointer increment from the asm), do many passes over a small (16K) array so they all hit in L1d cache. Use a nested loop, or use an and to wrap an index.
Doing that, yes you should be able to measure the 3/clock throughput of add mem, reg on Zen3 and later, even if you leave in the movslq overhead and crap like that from compiler -Og output.
When you're truly micro-benchmarking to find out stuff about throughput of one form of one instruction, it's usually easier to write asm by hand than to coax a compiler into emitting the loop you want. (As long as you know enough asm to avoid pitfalls, e.g. .balign 64 before the loop just for good measure, to hopefully avoid front-end bottlenecks.)
See also https://uops.info/ for how they measure; for any given test, you can click on the link to see the asm loop body for the experiments they ran, and the raw perf counter outputs for each variation on the test. (Although I have to admit I forget what MPERF and APERF mean for AMD CPUs; the results for Intel CPUs are more obvious.) e.g. https://uops.info/html-tp/ZEN3/ADD_R32_M32-Measurements.html is the Zen3 results, which includes a test of 4 or 8 independent add reg, [r14+const] instructions as the inner loop body.
They also tested with an indexed addressing mode. With "With unroll_count=200 and no inner loop" they got identical results for MPERF / APERF / UOPS for 4 independent adds, with indexed vs. non-indexed addressing modes. (Their loops don't have a pointer increment.)

loop unrolling not giving expected speedup for floating-point dot product

/* Inner product. Accumulate in temporary */
void inner4(vec_ptr u, vec_ptr v, data_t *dest)
{
long i;
long length = vec_length(u);
data_t *udata = get_vec_start(u);
data_t *vdata = get_vec_start(v);
data_t sum = (data_t) 0;
for (i = 0; i < length; i++) {
sum = sum + udata[i] * vdata[i];
}
*dest = sum;
}
Write a version of the inner product procedure described in the above problem that
uses 6 × 1a loop unrolling . For x86-64, our measurements of the unrolled version
give a CPE of 1.07 for integer data but still 3.01 for both floating-point data.
My code for 6*1a version of loop unrolling
void inner4(vec_ptr u, vec_ptr v, data_t *dest){
long i;
long length = vec_length(u);
data_t *udata = get_vec_start(u);
data_t *vdata = get_vec_start(v);
long limit = length -5;
data_t sum = (data_t) 0;
for(i=0; i<limit; i+=6){
sum = sum +
((udata[ i ] * vdata[ i ]
+ udata[ i+1 ] * vdata[ i+1 ])
+ (udata[ i+2 ] * vdata[ i+2 ]
+ udata[ i+3 ] * vdata[ i+3 ]))
+ ((udata[ i+4 ] * vdata[ i+4 ])
+ udata[ i+5 ] * vdata[ i+5 ]);
}
for (i = 0; i < length; i++) {
sum = sum + udata[i] * vdata[i];
}
*dest = sum;
}
Question: Explain why any (scalar) version of an inner product procedure running on an Intel Core i7 Haswell processor cannot achieve a CPE less than 1.00.
Any idea how to solve the problem?
Your unroll doesn't help with the FP latency bottleneck:
sum + x + y + z without -ffast-math is the same order of operations as sum += x; sum += y; ... so you haven't done anything about the single dependency chain running through all the + operations. Loop overhead (or front-end throughput) is not the bottleneck, it's the 3 cycle latency of addss on Haswell, so this unroll makes basically no difference.
What would work is sum += u[i]*v[i] + u[i+1]*v[i+1] + ... as a way to unroll without multiple accumulators, because then the sum of each group of elements is independent.
It costs slightly more math operations that way, like starting with a mul and ending with an add, but the middle ones can still contract into FMAs if you compile with -march=haswell. See comments on AVX performance slower for bitwise xor op and popcount for an example of GCC turning a naive unroll like sum += u0*v0; sum += u1*v1 into sum += u0*v0 + u1*v1;. In that case the problem was slightly different: sum of squared differences like sum += (u0-v0)**2 + (u1-v1)**2;, but it boils down to the same latency problem of ultimately doing some multiplies and adds.
The other way to solve the problem is with multiple accumulators, allowing all the operations to be FMAs. But Haswell has 5-cycle latency FMA, and 3-cycle latency addss, so doing the sum += ... addition on its own, not as part of an FMA, actually helps with the latency bottleneck on Haswell (unlike on Skylake add/sub/mul are all 4 cycle latency). The following all show unrolling with multiple accumulators, instead of with adding groups together like the first towards pairwise summation like you're doing:
Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)
When, if ever, is loop unrolling still useful?
Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell
FP math instruction throughput isn't the bottleneck for a big dot product on modern CPUs, only latency. Or load throughput if you unroll enough.
Explain why any (scalar) version of an inner product procedure running on an Intel Core i7 Haswell processor cannot achieve a CPE less than 1.00.
Each element takes 2 loads, and with only 2 load ports, that's a hard throughput bottleneck. (https://agner.org/optimize/ / https://www.realworldtech.com/haswell-cpu/5/)
I'm assuming you're counting an "element" as an i value, a pair of floats, one each from udata[i] and vdata[i]. The FP FMA throughput bottleneck is also 2/clock on Haswell (whether they're scalar, 128-bit, or 256-bit vectors), but dot product takes 2 loads per FMA. In theory, even Sandybridge or maybe even K8 could achieve 1 element per clock, with separate mul and add instructions, since they both support 2 loads per clock, and have a wide enough pipeline to get load / mulss / addss through the pipeline with some room to spare.

optimized sum of an array of doubles in C [duplicate]

This question already has answers here:
How to optimize these loops (with compiler optimization disabled)?
(3 answers)
Closed 5 years ago.
I've got an assignment where I must take a program and make it more efficient in terms of time.
the original code is:
#include <stdio.h>
#include <stdlib.h>
// You are only allowed to make changes to this code as specified by the comments in it.
// The code you submit must have these two values.
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main(void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
// You can add variables between this comment ...
long int help;
// ... and this one.
// Please change 'your name' to your actual name.
printf("CS201 - Asgmt 4 - I. Forgot\n");
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
int j;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += array[j];
help++;
}
// ... and this one. But your inner loop must do the same
// number of additions as this one does.
}
// You can add some final code between this comment ...
// ... and this one.
return 0;
}
I almost exclusively modified the second for loop by changing it to
double *j=array;
double *p=array+ARRAY_SIZE;
for(; j<p;j+=10){
sum += j[0]+j[1]+j[2]+j[3]+j[4]+j[5]+j[6]+j[7]+j[8]+j[9];
{
this on its own was able to reduce the time down to the criteria...
it already seems to work but are there any mistakes i'm not seeing?
I posted an improved version of this answer on a duplicate of this: C loop optimization help for final assignment. It was originally just a repost, but then I made some changes to answer the differences in that question. I forget what's different, but you should probably read that one instead. Maybe I should just delete this one.
See also other optimization guides in the x86 tag wiki.
First of all, it's a really crap sample because it doesn't have anything to stop a smart compiler from optimizing away the entire thing. It doesn't even print the sum. Even gcc -O1 (instead of -O3) threw away some of the looping.
Normally you'd put your code in a function, and call it in a loop from main() in another file. And compile them separately, without whole-program cross-file optimisation, so the compiler can't do optimisations based on the compile-time constants you call it with. The repeat-loop being wrapped so tightly around the actual loop over the array is causing havoc with gcc's optimizer (see below).
Also:
gcc -Wall -O3 -march=native fast-loop-cs201.c -o fl
fast-loop-cs201.c: In function ‘main’:
fast-loop-cs201.c:17:14: warning: ‘help’ is used uninitialized in this function [-Wuninitialized]
long int help;
I have to agree with EOF's disparaging remarks about your prof. Giving out code that optimizes away to nothing, and with uninitialized variables, is utter nonsense.
Some people are saying in comments that "the compiler doesn't matter", and that you're supposed to do optimize your C source for the CPU microarchitecture, rather than letting the compiler do it. This is crap: for good performance, you have to be aware of what compilers can do, and can't do. Some optimizations are "brittle", and a small seemingly-innocent change to the source will stop the compiler from doing something.
I assume your prof mentioned a few things about performance. There are a crapton of different things that could come into play here, many of which I assume didn't get mentioned in a 2nd-year CS class.
Besides multithreading with openmp, there's vectorizing with SIMD. There are also optimizations for modern pipelined CPUs: specifically, avoid having one long dependency chain.
Further essential reading:
Agner Fog's guides for optimizing C and asm for x86. Some of it applies to all CPUs.
What Every Programmer Should Know About Memory
Your compiler manual is also essential, esp. for floating point code. Floating point has limited precision, and is not associative. The final sum does depend on which order you do the additions in. However, usually the difference in rounding error is small. So the compiler can get a big speedup by re-ordering things if you use -ffast-math to allow it. This may have been what your unroll-by-10 allowed.
Instead of just unrolling, keeping multiple accumulators which you only add up at the end can keep the floating point execution units saturated, because FP instructions have latency != throughput. If you need the result of the last op to be complete before the next one can start, you're limited by latency. For FP add, that's one per 3 cycles. In Intel Sandybridge, IvB, Haswell, and Broadwell, the throughput of FP add is one per cycle. So you need to keep at least 3 independent ops that can be in flight at once to saturate the machine. For Skylake, it's 2 per cycle with latency of 4 clocks. (On the plus side for Skylake, FMA is down to 4 cycle latency.)
In this case, there's also basic stuff like pulling things out of the loop, e.g. help += ARRAY_SIZE.
compiler options
I started out with the original inner loop, with just help += ARRAY_SIZE pulled out, and adding a printf at the end so gcc doesn't optimize everything away. Let's try some compiler options and see what we can achieve with gcc 4.9.2 (on my i5 2500k Sandybridge. 3.8GHz max turbo (slight OC), 3.3GHz sustained (irrelevant for this short benchmark)):
gcc -O0 fast-loop-cs201.c -o fl: 16.43s performance is a total joke. Variables are stored to memory after every operation, and re-loaded before the next. This is a bottleneck, and adds a lot of latency. Not to mention losing out on actual optimisations. Timing / tuning code with -O0 is not useful.
-O1: 4.87s
-O2: 4.89s
-O3: 2.453s (uses SSE to do 2 at once. I'm of course using a 64bit system, so hardware support for -msse2 is baseline.)
-O3 -ffast-math -funroll-loops: 2.439s
-O3 -march=sandybridge -ffast-math -funroll-loops: 1.275s (uses AVX to do 4 at once.)
-Ofast ...: no gain
-O3 -ftree-parallelize-loops=4 -march=sandybridge -ffast-math -funroll-loops: 0m2.375s real, 0m8.500s user. Looks like locking overhead killed it. It only spawns the 4 threads total, but the inner loop is too short for it to be a win (because it collects the sums every time, instead of giving one thread the first 1/4 of the outer loop iterations).
-Ofast -fprofile-generate -march=sandybridge -ffast-math, run it, then
-Ofast -fprofile-use -march=sandybridge -ffast-math: 1.275s
clang-3.5 -Ofast -march=native -ffast-math: 1.070s. (clang doesn't support -march=sandybridge).
gcc -O3 vectorizes in a hilarious way: The inner loop does 2 (or 4) iterations of the outer loop in parallel, by broadcasting one array element to all elements of an xmm (or ymm) register, and doing an addpd on that. So it sees the same values are being added repeatedly, but even -ffast-math doesn't let gcc just turn it into a multiply. Or switch the loops.
clang-3.5 vectorizes a lot better: it vectorizes the inner loop, instead of the outer, so it doesn't need to broadcast. It even uses 4 vector registers as 4 separate accumulators. However, it doesn't assume that calloc returns aligned memory, and for some reason it thinks the best bet is a pair of 128b loads.
vmovupd -0x60(%rbx,%rcx,8),%xmm4`
vinsertf128 $0x1,-0x50(%rbx,%rcx,8),%ymm4,%ymm4
It's actually slower when I tell it that the array is aligned. (with a stupid hack like array = (double*)((ptrdiff_t)array & ~31); which actually generates an instruction to mask off the low 5 bits, because clang-3.5 doesn't support gcc's __builtin_assume_aligned.) I think the way the tight loop of 4x vaddpd mem, %ymmX,%ymmX is aligned puts cmp $0x271c,%rcx crossing a 32B boundary, so it can't macro-fuse with jne. uop throughput shouldn't be an issue, though, since this code is only getting 0.65insns per cycle (and 0.93 uops / cycle), according to perf.
Ahh, I checked with a debugger, and calloc is only returning a 16B-aligned pointer. So half the 32B memory accesses are crossing a cache line, causing a big slowdown. I guess it is slightly faster to do two separate 16B loads when your pointer is 16B-aligned but not 32B-aligned, on Sandybridge. The compiler is making a good choice here.
Source level changes
As we can see from clang beating gcc, multiple accumulators are excellent. The most obvious way to do this would be:
for (j = 0; j < ARRAY_SIZE; j+=4) { // unroll 4 times
sum0 += array[j];
sum1 += array[j+1];
sum2 += array[j+2];
sum3 += array[j+3];
}
and then don't collect the 4 accumulators into one until after the end of the outer loop.
Your source change of
sum += j[0]+j[1]+j[2]+j[3]+j[4]+j[5]+j[6]+j[7]+j[8]+j[9];
actually has a similar effect, thanks to out-of-order execution. Each group of 10 is a separate dependency chain. order-of-operations rules say the j values get added together first, and then added to sum. So the loop-carried dependency chain is still only the latency of one FP add, and there's lots of independent work for each group of 10. Each group is a separate dependency chain of 9 adds, and takes few enough instructions for the out-of-order execution hardware to see the start of the next chain and, and find the parallelism to keep those medium latency, high throughput FP execution units fed.
With -O0, as your silly assignment apparently requires, values are stored to RAM at the end of every statement. (Technically, at every "sequence point", as the C standards call it.) Writing longer expressions without updating any variables, even temporaries, will make -O0 run faster, but it's not a useful optimisation. Don't waste your time on changes that only help with -O0, esp. not at the expense of readability.
Using 4-accumulators and not adding them together until the end of the outer loop defeats clang's auto-vectorizer. It still runs in only 1.66s (vs. 4.89 for gcc's non-vectorized -O2 with one accumulator). Even gcc -O2 without -ffast-math also gets 1.66s for this source change. Note that ARRAY_SIZE is known to be a multiple of 4, so I didn't include any cleanup code to handle the last up-to-3 elements (or to avoid reading past the end of the array, which would happen as written now). It's really easy to get something wrong and read past the end of the array when doing this.
gcc, on the other hand, does vectorize this, but it also pessimises (un-optimises) the inner loop into a single dependency chain. I think it's doing multiple iterations of the outer loop, again.
Using gcc's platform-independent vector extensions, I wrote a version which compiles into apparently-optimal code:
// compile with gcc -g -Wall -std=gnu11 -Ofast -fno-tree-vectorize -march=native fast-loop-cs201.vec.c -o fl3-vec
#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include <assert.h>
#include <string.h>
// You are only allowed to make changes to this code as specified by the comments in it.
// The code you submit must have these two values.
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main(void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
// You can add variables between this comment ...
long int help = 0;
typedef double v4df __attribute__ ((vector_size (8*4)));
v4df sum0={0}, sum1={0}, sum2={0}, sum3={0};
const size_t array_bytes = ARRAY_SIZE*sizeof(double);
double *aligned_array = NULL;
// this more-than-declaration could go in an if(i == 0) block for strict compliance with the rules
if ( posix_memalign((void**)&aligned_array, 32, array_bytes) ) {
exit (1);
}
memcpy(aligned_array, array, array_bytes); // In this one case: faster to align once and have no extra overhead for N_TIMES through the loop
// ... and this one.
// Please change 'your name' to your actual name.
printf("CS201 - Asgmt 4 - I. Forgot\n");
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
/*
#if defined(__GNUC__) && (__GNUC__ * 100 + __GNUC_MINOR__) >= 407 // GCC 4.7 or later.
array = __builtin_assume_aligned(array, 32);
#else
// force-align for other compilers. This loop-invariant will be done outside the loop.
array = (double*) ((ptrdiff_t)array & ~31);
#endif
*/
assert ( ARRAY_SIZE / (4*4) == (ARRAY_SIZE+15) / (4*4) ); // We don't have a cleanup loop to handle where the array size isn't a multiple of 16
// incrementing pointers can be more efficient than indexing arrays
// esp. on recent Intel where micro-fusion only works with one-register addressing modes
// of course, the compiler can always generate pointer-incrementing asm from array-indexing source
const double *start = aligned_array;
while ( (ptrdiff_t)start & 31 ) {
// annoying loops like this are the reason people use aligned buffers
sum += *start++; // scalar until we reach 32B alignment
// in practice, this loop doesn't run, because we copy into an aligned buffer
// This will also require a cleanup loop, and break our multiple-of-16 doubles assumption.
}
const v4df *end = (v4df *)(aligned_array+ARRAY_SIZE);
for (const v4df *p = (v4df *)start ; p+3 < end; p+=4) {
sum0 += p[0]; // p+=4 increments the pointer by 4 * 4 * 8 bytes
sum1 += p[1]; // make sure you keep track of what you're incrementing
sum2 += p[2];
sum3 += p[3];
}
// the compiler might be smart enough to pull this out of the inner loop
// in fact, gcc turns this into a 64bit movabs outside of both loops :P
help+= ARRAY_SIZE;
// ... and this one. But your inner loop must do the same
// number of additions as this one does.
/* You could argue legalese and say that
if (i == 0) {
for (j ...)
sum += array[j];
sum *= N_TIMES;
}
* still does as many adds in its *INNER LOOP*, but it just doesn't run it as often
*/
}
// You can add some final code between this comment ...
sum0 = (sum0 + sum1) + (sum2 + sum3);
sum += sum0[0] + sum0[1] + sum0[2] + sum0[3];
printf("sum = %g; help=%ld\n", sum, help); // defeat the compiler.
free (aligned_array);
free (array); // not strictly necessary, because this is the end of main(). Leaving it out for this special case is a bad example for a CS class, though.
// ... and this one.
return 0;
}
The inner loop compiles to:
4007c0: c5 e5 58 19 vaddpd (%rcx),%ymm3,%ymm3
4007c4: 48 83 e9 80 sub $0xffffffffffffff80,%rcx # subtract -128, because -128 fits in imm8 instead of requiring an imm32 to encode add $128, %rcx
4007c8: c5 f5 58 49 a0 vaddpd -0x60(%rcx),%ymm1,%ymm1 # one-register addressing mode can micro-fuse
4007cd: c5 ed 58 51 c0 vaddpd -0x40(%rcx),%ymm2,%ymm2
4007d2: c5 fd 58 41 e0 vaddpd -0x20(%rcx),%ymm0,%ymm0
4007d7: 4c 39 c1 cmp %r8,%rcx # compare with end with p
4007da: 75 e4 jne 4007c0 <main+0xb0>
(For more, see online compiler output at godbolt. Note I had to cast the return value of calloc, because godbolt uses C++ compilers, not C compilers. The inner loop is from .L3 to jne .L3. See https://stackoverflow.com/tags/x86/info for x86 asm links. See also Micro fusion and addressing modes, because this Sandybridge change hasn't made it into Agner Fog's manuals yet.).
performance:
$ perf stat -e task-clock,cycles,instructions,r1b1,r10e,stalled-cycles-frontend,stalled-cycles-backend,L1-dcache-load-misses,cache-misses ./fl3-vec
CS201 - Asgmt 4 - I. Forgot
sum = 0; help=6000000000
Performance counter stats for './fl3-vec':
1086.571078 task-clock (msec) # 1.000 CPUs utilized
4,072,679,849 cycles # 3.748 GHz
2,629,419,883 instructions # 0.65 insns per cycle
# 1.27 stalled cycles per insn
4,028,715,968 r1b1 # 3707.733 M/sec # unfused uops
2,257,875,023 r10e # 2077.982 M/sec # fused uops. lower than insns because of macro-fusion
3,328,275,626 stalled-cycles-frontend # 81.72% frontend cycles idle
1,648,011,059 stalled-cycles-backend # 40.47% backend cycles idle
751,736,741 L1-dcache-load-misses # 691.843 M/sec
18,772 cache-misses # 0.017 M/sec
1.086925466 seconds time elapsed
I still don't know why it's getting such low instructions per cycle. The inner loop is using 4 separate accumulators, and I checked with gdb that the pointers are aligned. So cache-bank conflicts shouldn't be the problem. Sandybridge L2 cache can sustain one 32B transfers per cycle, which should keep up with the one 32B FP vector add per cycle.
Loads 32B loads from L1 take 2 cycles (it wasn't until Haswell that Intel made 32B loads a single-cycle operation). However, there are 2 load ports, so the sustained throughput is 32B per cycle (which we're not reaching).
Perhaps the loads need to be pipelined ahead of when they're used, to minimize having the ROB (re-order buffer) fill up when a load stalls? But the perf counters indicate a fairly high L1 cache hit rate, so hardware prefetch from L2 to L1 seems to be doing its job.
0.65 instructions per cycle is only about half way to saturating the vector FP adder. This is frustrating. Even IACA says the loop should run in 4 cycles per iteration. (i.e. saturate the load ports and port1 (where the FP adder lives)) :/
update: I guess L2 latency was the problem after all. Reducing ARRAY_SIZE to 1008 (multiple of 16), and increasing N_TIMES by a factor of 10, brought the runtime down to 0.5s. That's 1.68 insns per cycle. (The inner loop is 7 total instructions for 4 FP adds, thus we are finally saturating the vector FP add unit, and the load ports.) IDK why the HW prefetcher can't get ahead after one stall, and then stay ahead. Possibly software prefetch could help? Maybe somehow avoid having the HW prefetcher run past the array, and instead start prefetching the start of the array again. (loop tiling is a much better solution, see below.)
Intel CPUs only have 32k each L1-data and L1-instruction caches. I think your array would just barely fit in the L1 on an AMD CPU.
Gcc's attempt to vectorize by broadcasting the same value into a parallel add doesn't seem so crazy. If it had managed to get this right (using multiple accumulators to hide latency), that would have allowed it to saturate the vector FP adder with only half the memory bandwidth. As-is, it was pretty much a wash, probably because of overhead in broadcasting.
Also, it's pretty silly. The N_TIMES is a just a make-work repeat. We don't actually want to optimize for doing the identical work multiple times. Unless we want to win at silly assignments like this. A source-level way to do this would be to increment i in the part of the code we're allowed to modify:
for (...) {
sum += a[j] + a[j] + a[j] + a[j];
}
i += 3; // The inner loop does 4 total iterations of the outer loop
More realistically, to deal with this you could interchange your loops (loop over the array once, adding each value N_TIMES times). I think I've read that Intel's compiler will sometimes do that for you.
A more general technique is called cache blocking, or loop tiling. The idea is to work on your input data in small blocks that fit in cache. Depending on your algorithm, it can be possible to do various stages of thing on a chunk, then repeat for the next chunk, instead of having each stage loop over the whole input. As always, once you know the right name for a trick (and that it exists at all), you can google up a ton of info.
You could rules-lawyer your way into putting an interchanged loop inside an if (i == 0) block in the part of the code you're allowed to modify. It would still do the same number of additions, but in a more cache-optimal order.
I would try this for the inner loop:
double* tmp = array;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += *tmp; // Use a pointer
tmp++; // because it is faster to increment the pointer
// than it is to recalculate array+j every time
help++;
}
or better
double* tmp = array;
double* end = array + ARRAY_SIZE; // Get rid of variable j by calculating
// the end criteria and
while (tmp != end) { // just compare if the end is reached
sum += *tmp;
tmp++;
help++;
}
I think You should read about openmp library if You could use multithreaded. But this is so simple example that I think could not be optimized.
Certain thing is that You don't need to declare i and j before for loop. That would do:
for (int i = 0; i < N_TIMES; i++)

Effects of Loop unrolling on memory bound data

I have been working with a piece of code which is intensively memory bound. I am trying to optimize it within a single core by manually implementing cache blocking, sw prefetching, loop unrolling etc. Even though cache blocking gives significant improvement in performance. However when i introduce loop unrolling I get tremendous performance degradation.
I am compiling with Intel icc with compiler flags -O2 and -ipo in all my test cases.
My code is similar to this (3D 25-point stencil):
void stencil_baseline (double *V, double *U, int dx, int dy, int dz, double c0, double c1, double c2, double c3, double c4)
{
int i, j, k;
for (k = 4; k < dz-4; k++)
{
for (j = 4; j < dy-4; j++)
{
//x-direction
for (i = 4; i < dx-4; i++)
{
U[k*dy*dx+j*dx+i] = (c0 * (V[k*dy*dx+j*dx+i]) //center
+ c1 * (V[k*dy*dx+j*dx+(i-1)] + V[k*dy*dx+j*dx+(i+1)])
+ c2 * (V[k*dy*dx+j*dx+(i-2)] + V[k*dy*dx+j*dx+(i+2)])
+ c3 * (V[k*dy*dx+j*dx+(i-3)] + V[k*dy*dx+j*dx+(i+3)])
+ c4 * (V[k*dy*dx+j*dx+(i-4)] + V[k*dy*dx+j*dx+(i+4)]));
}
//y-direction
for (i = 4; i < dx-4; i++)
{
U[k*dy*dx+j*dx+i] += (c1 * (V[k*dy*dx+(j-1)*dx+i] + V[k*dy*dx+(j+1)*dx+i])
+ c2 * (V[k*dy*dx+(j-2)*dx+i] + V[k*dy*dx+(j+2)*dx+i])
+ c3 * (V[k*dy*dx+(j-3)*dx+i] + V[k*dy*dx+(j+3)*dx+i])
+ c4 * (V[k*dy*dx+(j-4)*dx+i] + V[k*dy*dx+(j+4)*dx+i]));
}
//z-direction
for (i = 4; i < dx-4; i++)
{
U[k*dy*dx+j*dx+i] += (c1 * (V[(k-1)*dy*dx+j*dx+i] + V[(k+1)*dy*dx+j*dx+i])
+ c2 * (V[(k-2)*dy*dx+j*dx+i] + V[(k+2)*dy*dx+j*dx+i])
+ c3 * (V[(k-3)*dy*dx+j*dx+i] + V[(k+3)*dy*dx+j*dx+i])
+ c4 * (V[(k-4)*dy*dx+j*dx+i] + V[(k+4)*dy*dx+j*dx+i]));
}
}
}
}
When I do loop unrolling on the innermost loop (dimension i) and unroll in directions x,y,z separately by unroll factor 2,4,8 respectively, I get performance degradation in all 9 cases i.e. unroll by 2 on direction x, unroll by 2 on direction y, unroll by 2 in direction z, unroll by 4 in direction x ... etc.
But when I perform loop unrolling on the outermost loop (dimension k) by factor of 8 (2 & 4 also), I get v.good performance improvement which is even better than cache blocking.
I even tried profiling my code with Intel Vtune. It seemed like the bottlenecks where mainly due to 1.LLC Miss and 2. LLC Load Misses serviced by Remote DRAM.
I am unable to understand why unrolling the innermost fastest loop in giving performance degradation whereas unrolling the outermost, slowest dimension is fetching performance improvement. However, this improvement in the latter case is when i use -O2 and -ipo when compiling with icc.
I am not sure how to interpret these statistics. Can someone help shed some light on this.
This strongly suggests that you are causing instruction cache misses by the unrolling, which is typical. In the age of modern hardware, unrolling no longer automatically means faster code. If each inner loop fits in a cache line, you will get better performance.
You may be able to unroll manually, to limit the size of the generated code, but this will require examining the generated machine-language instructions -- and their position -- to ensure that your loop is within a single cache line. Cache lines are typically 64 bytes long, and aligned on 64-byte boundaries.
Outer loops do not have the same effect. They will likely be outside of the instruction cache regardless of the unroll level. Unrolling these results in fewer branches, which is why you get better performance.
"Load misses serviced by remote DRAM" means that you allocated memory on one NUMA node, but now you are running on the other. Setting process or thread affinity based on NUMA is the answer.
Remote DRAM takes almost twice as long to read as local DRAM on the Intel machines that I have used.

Resources