how to properly do multiply accumulate with NEON intrinsics - c

i need to do a simple multiply accumulate of two signed 8 bit arrays.
This routine runs every millisecond on an ARM7 embedded device. I am trying to speed it up a bit. I have already tried optimizing and enabling vector ops.
-mtune=cortex-a15.cortex-a7 -mfpu=neon-vfpv4 -ftree-vectorize -ffast-math -mfloat-abi=hard
this helped but I am still running close to the edge.
this is the 'c' code.
for(i = 4095; i >= 0; --i)
{
accum += arr1[i]*arr2[i];
}
I am trying to use NEON intrinsics. This loop runs ~5 times faster, but I get different results. I am pretty sure I am not properly retrieving the accumulation, or it rolls over before I do. Any help/pointers is greatly appreciated. Any detailed docs would also be helpful.
for(int i = 256; i > 0; --i)
{
int8x16_t vec16a = vld1q_s8(&arr1[index]);
int8x16_t vec16b = vld1q_s8(&arr2[index]);
vec16res = vmlaq_s8(vec16res, vec16a, vec16b);
index+=16;
}
EDIT to post solution.
Thanks to tips from all!
I dropped to to 8x8 and have a fast solution
using the below code I achieved a "fast enough" time. Not as fast as the 128bit version but good enough.
I added __builtin_prefetch() for the data, and did a 10 pass avg.
Neon is substantially faster.
$ ./test 10
original code time ~ 30392nS
optimized C time ~ 8458nS
NEON elapsed time ~ 3199nS
int32_t sum = 0;
int16x8_t vecSum = vdupq_n_s16(0);
int8x8_t vec8a;
int8x8_t vec8b;
int32x4_t sum32x4;
int32x2_t sum32x2;
#pragma unroll
for (i = 512; i > 0; --i)
{
vec8a = vld1_s8(&A[index]);
vec8b = vld1_s8(&B[index]);
vecSum = vmlal_s8(vecSum,vec8a,vec8b);
index += 8;
}
sum32x4 = vaddl_s16(vget_high_s16(vecSum),vget_low_s16(vecSum));
sum32x2 = vadd_s32(vget_high_s32(sum32x4),vget_low_s32(sum32x4));
sum += vget_lane_s32(vpadd_s32(sum32x2,sum32x2),0);

Your issue is likely overflow, so you'll need to lengthen when you do your multiply-accumulate.
As you're on ARMv7, you'll want vmlal_s8.
ARMv8 A64 has vmlal_high_s8 which allows you to stay in 128-bit vectors, which will give an added speed-up.
As mentioned in comments, seeing what auto-vectorization will do with -O options / pragma unroll is very valuable, and learning from the godbolt of that. Unrolling often gives speed-ups when doing by hand as well.
Lots more valuable tips on optimization in the Arm Neon resources.

Related

C, Way to make multiply in array element fast

import numpy as np
array = np.random.rand(16384)
array *= 3
above python code make each element in array has 3 times multiplied value of its own.
On my Laptop, these code took 5ms
Below code is what i tried on C language.
#include <headers...>
array = make 16384 elements...;
for(int i = 0 ; i < 16384 ; ++i)
array[i] *= 3
compile command was
gcc -O2 main.cpp
it takes almost 30ms.
Is there any way i can reduce process time of this?
P.S it was my fault. I confused unit of timestamp value.
this code is faster than numpy. sorry for this question.
This sounds pretty unbelievable. For reference, I wrote a trivial (but complete) program that does roughly what you seem to be describing. I used C++ so I could use its chrono library to get (much) more precise timing than C's clock provides, but I wouldn't expect that to affect the speed at all.
#include <iostream>
#include <chrono>
#define SIZE (16384)
float array[SIZE];
int main() {
using namespace std::chrono;
for (int i = 0; i < SIZE; i++) {
array[i] = i;
}
auto start = high_resolution_clock::now();
for (int i=0; i<SIZE; i++) {
array[i] *= 3.0;
}
auto stop = high_resolution_clock::now();
std::cout << duration_cast<microseconds>(stop - start).count() << '\n';
long total = 0;
for (int i = 0; i < SIZE; i++) {
total += i;
}
std::cout << "Ignore: " << total << "\n";
}
On my machine (2.8 GHz Haswell, so probably slower than whatever you're running) this shows a time of 7 or 8 microseconds, so around 600-700 times as fast as you're getting from Python.
Adding the compiler flag to use AVX 2 instructions reduces that to 4 microseconds, or a little more than 1000 times as fast (warning: AMD processors generally don't get as much of a speed boost from using AVX 2, but if you have a reasonably new AMD processor I'd expect it to be faster than this anyway).
Bottom line: the speed you're reporting for your C code only seems to make sense if you're running the code on some sort of slow microcontroller, or maybe a really old desktop system--though it would have to be quite old to run nearly as slow as you're reporting. My immediate guess is that even a 386 would be faster than that.
When/if you have something that takes enough time to justify it, you can also use OpenMP to run a loop like this in multiple threads. I tried that, but in this case the overhead of starting up and synchronizing the threads is (quite a bit) more than running in parallel can gain, so it's a net loss.
Compiler: VS 2019 (Microsoft (R) C/C++ Optimizing Compiler Version 19.27.28919.3 for x64).
Flags: /O2b2 /GL (and part of the time, /arch:AVX2)

SSE intrinsics without compiler optimization

I am new to SSE intrinsics and try to optimise my code by it. Here is my program about counting array elements which are equal to the given value.
I changed my code to SSE version but the speed almost doesn't change. I am wondering whether I use SSE in a wrong way...
This code is for an assignment where we're not allowed to enable compiler optimization options.
No SSE version:
int get_freq(const float* matrix, float value) {
int freq = 0;
for (ssize_t i = start; i < end; i++) {
if (fabsf(matrix[i] - value) <= FLT_EPSILON) {
freq++;
}
}
return freq;
}
SSE version:
#include <immintrin.h>
#include <math.h>
#include <float.h>
#define GETLOAD(n) __m128 load##n = _mm_load_ps(&matrix[i + 4 * n])
#define GETEQU(n) __m128 check##n = _mm_and_ps(_mm_cmpeq_ps(load##n, value), and_value)
#define GETCOUNT(n) count = _mm_add_ps(count, check##n)
int get_freq(const float* matrix, float givenValue, ssize_t g_elements) {
int freq = 0;
int i;
__m128 value = _mm_set1_ps(givenValue);
__m128 count = _mm_setzero_ps();
__m128 and_value = _mm_set1_ps(0x00000001);
for (i = 0; i + 15 < g_elements; i += 16) {
GETLOAD(0); GETLOAD(1); GETLOAD(2); GETLOAD(3);
GETEQU(0); GETEQU(1); GETEQU(2); GETEQU(3);
GETCOUNT(0);GETCOUNT(1);GETCOUNT(2);GETCOUNT(3);
}
__m128 shuffle_a = _mm_shuffle_ps(count, count, _MM_SHUFFLE(1, 0, 3, 2));
count = _mm_add_ps(count, shuffle_a);
__m128 shuffle_b = _mm_shuffle_ps(count, count, _MM_SHUFFLE(2, 3, 0, 1));
count = _mm_add_ps(count, shuffle_b);
freq = _mm_cvtss_si32(count);
for (; i < g_elements; i++) {
if (fabsf(matrix[i] - givenValue) <= FLT_EPSILON) {
freq++;
}
}
return freq;
}
If you need to compile with -O0, then do as much as possible in a single statement. In normal code, int a=foo(); bar(a); will compile to the same asm as bar(foo()), but in -O0 code, the second version will probably be faster, because it doesn't store the result to memory and then reload it for the next statement.
-O0 is designed to give the most predictable results from debugging, which is why everything is stored to memory after every statement. This is obviously horrible for performance.
I wrote a big answer a while ago for a different question from someone else with a stupid assignment like yours that required them to optimize for -O0. Some of that may help.
Don't try too hard on this assignment. Probably most of the "tricks" that you figure out that make your code run faster with -O0 will only matter for -O0, but make no difference with optimization enabled.
In real life, code is typically compiled with clang or gcc -O2 at least, and sometimes -O3 -march=haswell or whatever to auto-vectorize. (Once it's debugged and you're ready to optimize.)
Re: your update:
Now it compiles, and the horrible asm from the SSE version can be seen. I put it on godbolt along with a version of the scalar code that actually compiles, too. Intrinsics usually compile very badly with optimization disabled, with the inline functions still having args and return values that result in actual load/store round trips (store-forwarding latency) even with __attribute__((always_inline)). See Demonstrator code failing to show 4 times faster SIMD speed with optimization disabled for example.
The scalar version comes out a lot less bad. Its source does everything in one expression, so temporaries stay in registers. The loop counter is still in memory, though, bottlenecking it to at best one iteration per 6 cycles on Haswell, for example. (See the x86 tag wiki for optimization resources.)
BTW, a vectorized fabsf() is easy, see Fastest way to compute absolute value using SSE. That and an SSE compare for less-than should do the trick to give you the same semantics as your scalar code. (But makes it even harder to get -O0 to not suck).
You might do better just manually unrolling your scalar version one or two times, because -O0 sucks too much.
Some compilers are pretty good about doing optimization of vectors. Did you check the generated assembly of optimized build of both versions? Isn't the "naive" version actually using SIMD or other optimization techniques?

How to optimize these loops (with compiler optimization disabled)?

I need to optimize some for-loops for speed (for a school assignment) without using compiler optimization flags.
Given a specific Linux server (owned by the school), a satisfactory improvement is to make it run under 7 seconds, and a great improvement is to make it run under 5 seconds. This code that I have right here gets about 5.6 seconds. I am thinking I may need to use pointers with this in some way to get it to go faster, but I'm not really sure. What options do I have?
The file must remain 50 lines or less (not counting comments).
#include <stdio.h>
#include <stdlib.h>
// You are only allowed to make changes to this code as specified by the comments in it.
// The code you submit must have these two values.
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main(void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
// You can add variables between this comment ...
register double sum1 = 0, sum2 = 0, sum3 = 0, sum4 = 0, sum5 = 0, sum6 = 0, sum7 = 0, sum8 = 0, sum9 = 0;
register int j;
// ... and this one.
printf("CS201 - Asgmt 4 - \n");
for (i = 0; i < N_TIMES; i++)
{
// You can change anything between this comment ...
for (j = 0; j < ARRAY_SIZE; j += 10)
{
sum += array[j];
sum1 += array[j + 1];
sum2 += array[j + 2];
sum3 += array[j + 3];
sum4 += array[j + 4];
sum5 += array[j + 5];
sum6 += array[j + 6];
sum7 += array[j + 7];
sum8 += array[j + 8];
sum9 += array[j + 9];
}
// ... and this one. But your inner loop must do the same
// number of additions as this one does.
}
// You can add some final code between this comment ...
sum += sum1 + sum2 + sum3 + sum4 + sum5 + sum6 + sum7 + sum8 + sum9;
// ... and this one.
return 0;
}
Re-posting a modified version of my answer from optimized sum of an array of doubles in C, since that question got voted down to -5. The OP of the other question phrased it more as "what else is possible", so I took him at his word and info-dumped about vectorizing and tuning for current CPU hardware. :)
The OP of that question eventually said he wasn't allowed to use compiler options higher than -O0, which I guess is the case here, too.
Summary:
Why using -O0 distorts things (unfairly penalizes things that are fine in normal code for a normal compiler). Using -O0 (the gcc/clang default) so your loops don't optimize away is not a valid excuse or a useful way to find out what will be faster with normal optimization enabled. (See also Idiomatic way of performance evaluation? for more about benchmark methods and pitfalls, like ways to enable optimization but still stop the compiler from optimizing away the work you want to measure.)
Stuff that's wrong with the assignment.
Types of optimizations. FP latency vs. throughput, and dependency chains. Link to Agner Fog's site. (Essential reading for optimization).
Experiments getting the compiler to optimize it (after fixing it to not optimize away). Best result with auto-vectorization (no source changes): gcc: half as fast as an optimal vectorized loop. clang: same speed as a hand-vectorized loop.
Some more comments on why bigger expressions are a perf win with -O0 only.
Source changes to get good performance without -ffast-math, making the code closer to what we want the compiler to do. Also some rules-lawyering ideas that would be useless in the real-world.
Vectorizing the loop with GCC architecture-neutral vectors, to see how close the auto-vectorizing compilers came to matching the performance of ideal asm code (since I checked the compiler output).
I think the point of the assignment is to sort of teach assembly-language performance optimizations using C with no compiler optimizations. This is silly. It's mixing up things the compiler will do for you in real life with things that do require source-level changes.
See Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?
-O0 doesn't just "not optimize", it makes the compiler store variables to memory after every statement instead of keeping them in registers. It does this so you get the "expected" results if you set a breakpoint with gdb and modify the value (in memory) of a C variable. Or even if you jump to another line in the same function. So each C statement has to be compiled to an independent block of asm that starts and ends with all variables in memory. For a modern portable compiler like gcc which already transforms through multiple internal representations of program flow on the way from source to asm, this part of -O0 requires explicitly de-optimizing its graph of data flow back into separate C statements. These store/reloads lengthen every loop-carried dependency chain so it's horrible for tiny loops if the loop counter is kept in memory. (e.g. 1 cycle per iteration for inc reg vs. 6c for inc [mem], creating a bottleneck on loop counter updates in tight loops).
With gcc -O0, the register keyword lets gcc keep a var in a register instead of memory, and thus can make a big difference in tight loops (Example on the Godbolt Compiler explorer). But that's only with -O0. In real code, register is meaningless: the compiler attempts to optimally use the available registers for variables and temporaries. register is already deprecated in ISO C++11 (but not C11), and there's a proposal to remove it from the language along with other obsolete stuff like trigraphs.
With an extra variables involved, -O0 hurts array indexing a bit more than pointer incrementing.
Array indexing usually makes code easier to read. Compilers sometimes fail to optimize stuff like array[i*width + j*width*height], so it's a good idea to change the source to do the strength-reduction optimization of turning the multiplies into += adds.
At an asm level, array indexing vs. pointer incrementing are close to the same performance. (x86 for example has addressing modes like [rsi + rdx*4] which are as fast as [rdi]. except on Sandybridge and later.) It's the compiler's job to optimize your code by using pointer incrementing even when the source uses array indexing, when that's faster.
For good performance, you have to be aware of what compilers can and can't do. Some optimizations are "brittle", and a small seemingly-innocent change to the source will stop the compiler from doing an optimization that was essential for some code to run fast. (e.g. pulling a constant computation out of a loop, or proving something about how different branch conditions are related to each other, and simplifying.)
Besides all that, it's a crap sample because it doesn't have anything to stop a smart compiler from optimizing away the entire thing. It doesn't even print the sum. Even gcc -O1 (instead of -O3) threw away some of the looping.
(You can fix this by printing sum at the end. gcc and clang don't seem to realize that calloc returns zeroed memory, and optimize it away to 0.0. See my code below.)
Normally you'd put your code in a function, and call it in a loop from main() in another file. And compile them separately, without whole-program cross-file optimisation, so the compiler can't do optimisations based on the compile-time constants you call it with. The repeat-loop being wrapped so tightly around the actual loop over the array is causing havoc with gcc's optimizer (see below).
Also, the other version of this question had an uninitialized variable kicking around. It looks like long int help was introduced by the OP of that question, not the prof. So I will have to downgrade my "utter nonsense" to merely "silly", because the code doesn't even print the result at the end. That's the most common way of getting the compiler not to optimize everything away in a microbenchmark like this.
I assume your prof mentioned a few things about performance. There are a crapton of different things that could come into play here, many of which I assume didn't get mentioned in a 2nd-year CS class.
Besides multithreading with openmp, there's vectorizing with SIMD. There are also optimizations for modern pipelined CPUs: specifically, avoid having one long dependency chain.
Further essential reading:
Agner Fog's guides for optimizing C and asm for x86. Some of it applies to all CPUs.
What Every Programmer Should Know About Memory
Your compiler manual is also essential, esp. for floating point code. Floating point has limited precision, and is not associative. The final sum does depend on which order you do the additions in. Usually the difference in rounding error is small, so the compiler can get a big speedup by re-ordering things if you use -ffast-math to allow it.
Instead of just unrolling, keep multiple accumulators which you only add up at the end, like you're doing with the sum0..sum9 unroll-by-10. FP instructions have medium latency but high throughput, so you need to keep multiple FP operations in flight to keep the floating point execution units saturated.
If you need the result of the last op to be complete before the next one can start, you're limited by latency. For FP add, that's one per 3 cycles. In Intel Sandybridge, IvB, Haswell, and Broadwell, the throughput of FP add is one per cycle. So you need to keep at least 3 independent ops that can be in flight at once to saturate the machine. For Skylake, it's 2 per cycle with latency of 4 clocks. (On the plus side for Skylake, FMA is down to 4 cycle latency.)
In this case, there's also basic stuff like pulling things out of the loop, e.g. help += ARRAY_SIZE.
Compiler Options
Lets start by seeing what the compiler can do for us.
I started out with the original inner loop, with just help += ARRAY_SIZE pulled out, and adding a printf at the end so gcc doesn't optimize everything away. Let's try some compiler options and see what we can achieve with gcc 4.9.2 (on my i5 2500k Sandybridge. 3.8GHz max turbo (slight OC), 3.3GHz sustained (irrelevant for this short benchmark)):
gcc -O0 fast-loop-cs201.c -o fl: 16.43s performance is a total joke. Variables are stored to memory after every operation, and re-loaded before the next. This is a bottleneck, and adds a lot of latency. Not to mention losing out on actual optimisations. Timing / tuning code with -O0 is not useful.
-O1: 4.87s
-O2: 4.89s
-O3: 2.453s (uses SSE to do 2 at once. I'm of course using a 64bit system, so hardware support for -msse2 is baseline.)
-O3 -ffast-math -funroll-loops: 2.439s
-O3 -march=sandybridge -ffast-math -funroll-loops: 1.275s (uses AVX to do 4 at once.)
-Ofast ...: no gain
-O3 -ftree-parallelize-loops=4 -march=sandybridge -ffast-math -funroll-loops: 0m2.375s real, 0m8.500s user. Looks like locking overhead killed it. It only spawns the 4 threads total, but the inner loop is too short for it to be a win: it collects the sums every time, instead of giving each thread 1/4 of the outer loop iterations.
-Ofast -fprofile-generate -march=sandybridge -ffast-math, run it, then
-Ofast -fprofile-use -march=sandybridge -ffast-math: 1.275s. profile-guided optimization is a good idea when you can exercise all the relevant code-paths, so the compiler can make better unrolling / inlining decisions.
clang-3.5 -Ofast -march=native -ffast-math: 1.070s. (clang 3.5 is too old to support -march=sandybridge. You should prefer to use a compiler version that's new enough to know about the target architecture you're tuning for, esp. if using -march to make code that doesn't need to run on older architectures.)
gcc -O3 vectorizes in a hilarious way: The inner loop does 2 (or 4) iterations of the outer loop in parallel, by broadcasting one array element to all elements of an xmm (or ymm) register, and doing an addpd on that. So it sees the same values are being added repeatedly, but even -ffast-math doesn't let gcc just turn it into a multiply. Or switch the loops.
clang-3.5 vectorizes a lot better: it vectorizes the inner loop, instead of the outer, so it doesn't need to broadcast. It even uses 4 vector registers as 4 separate accumulators. It knows that calloc only returns 16-byte aligned memory (on x86-64 System V), and when tuning for Sandybridge (before Haswell) it knows that 32-byte loads have a big penalty when misaligned. And that splitting them isn't too expensive since a 32-byte load takes 2 cycles in a load port anyway.
vmovupd -0x60(%rbx,%rcx,8),%xmm4
vinsertf128 $0x1,-0x50(%rbx,%rcx,8),%ymm4,%ymm4
This is worse on later CPUs, especially when the data does happen to be aligned at run-time; see Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd? about GCC versions where -mavx256-split-unaligned-load was on by default with -mtune=generic.
It's actually slower when I tell it that the array is aligned. (with a stupid hack like array = (double*)((ptrdiff_t)array & ~31); which actually generates an instruction to mask off the low 5 bits, because clang-3.5 doesn't support gcc's __builtin_assume_aligned.) In that case it uses a tight loop of 4x vaddpd mem, %ymm, %ymm. It only runs about 0.65 insns per cycle (and 0.93 uops / cycle), according to perf, so the bottleneck isn't front-end.
I checked with a debugger, and calloc is indeed returning a pointer that's an odd multiple of 16. (glibc for large allocations tends to allocate new pages, and put bookkeeping info in the initial bytes, always misaligning to any boundary wider than 16.) So half the 32B memory accesses are crossing a cache line, causing a big slowdown. It is slightly faster to do two separate 16B loads when your pointer is 16B-aligned but not 32B-aligned, on Sandybridge. (gcc enables -mavx256-split-unaligned-load and ...-store for -march=sandybridge, and also for the default tune=generic with -mavx, which is not so good especially for Haswell or with memory that's usually aligned by the compiler doesn't know about it.)
Source level changes
As we can see from clang beating gcc, multiple accumulators are excellent. The most obvious way to do this would be:
for (j = 0; j < ARRAY_SIZE; j+=4) { // unroll 4 times
sum0 += array[j];
sum1 += array[j+1];
sum2 += array[j+2];
sum3 += array[j+3];
}
and then don't collect the 4 accumulators into one until after the end of the outer loop.
Your (from the other question) source change of
sum += j[0]+j[1]+j[2]+j[3]+j[4]+j[5]+j[6]+j[7]+j[8]+j[9];
actually has a similar effect, thanks to out-of-order execution. Each group of 10 is a separate dependency chain. order-of-operations rules say the j values get added together first, and then added to sum. So the loop-carried dependency chain is still only the latency of one FP add, and there's lots of independent work for each group of 10. Each group is a separate dependency chain of 9 adds, and takes few enough instructions for the out-of-order execution hardware to see the start of the next chain and, and find the parallelism to keep those medium latency, high throughput FP execution units fed.
With -O0, as your silly assignment apparently requires, values are stored to RAM at the end of every statement. Writing longer expressions without updating any variables, even temporaries, will make -O0 run faster, but it's not a useful optimisation. Don't waste your time on changes that only help with -O0, esp. not at the expense of readability.
Using 4 accumulator variables and not adding them together until the end of the outer loop defeats clang's auto-vectorizer. It still runs in only 1.66s (vs. 4.89 for gcc's non-vectorized -O2 with one accumulator). Even gcc -O2 without -ffast-math also gets 1.66s for this source change. Note that ARRAY_SIZE is known to be a multiple of 4, so I didn't include any cleanup code to handle the last up-to-3 elements (or to avoid reading past the end of the array, which would happen as written now). It's really easy to get something wrong and read past the end of the array when doing this.
GCC, on the other hand, does vectorize this, but it also pessimises (un-optimises) the inner loop into a single dependency chain. I think it's doing multiple iterations of the outer loop, again.
Using gcc's platform-independent vector extensions, I wrote a version which compiles into apparently-optimal code:
// compile with gcc -g -Wall -std=gnu11 -Ofast -fno-tree-vectorize -march=native fast-loop-cs201.vec.c -o fl3-vec
#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include <assert.h>
#include <string.h>
// You are only allowed to make changes to this code as specified by the comments in it.
// The code you submit must have these two values.
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main(void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
// You can add variables between this comment ...
long int help = 0;
typedef double v4df __attribute__ ((vector_size (8*4)));
v4df sum0={0}, sum1={0}, sum2={0}, sum3={0};
const size_t array_bytes = ARRAY_SIZE*sizeof(double);
double *aligned_array = NULL;
// this more-than-declaration could go in an if(i == 0) block for strict compliance with the rules
if ( posix_memalign((void**)&aligned_array, 32, array_bytes) ) {
exit (1);
}
memcpy(aligned_array, array, array_bytes); // In this one case: faster to align once and have no extra overhead for N_TIMES through the loop
// ... and this one.
// Please change 'your name' to your actual name.
printf("CS201 - Asgmt 4 - I. Forgot\n");
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
/*
#if defined(__GNUC__) && (__GNUC__ * 100 + __GNUC_MINOR__) >= 407 // GCC 4.7 or later.
array = __builtin_assume_aligned(array, 32);
#else
// force-align for other compilers. This loop-invariant will be done outside the loop.
array = (double*) ((ptrdiff_t)array & ~31);
#endif
*/
assert ( ARRAY_SIZE / (4*4) == (ARRAY_SIZE+15) / (4*4) ); // We don't have a cleanup loop to handle where the array size isn't a multiple of 16
// incrementing pointers can be more efficient than indexing arrays
// esp. on recent Intel where micro-fusion only works with one-register addressing modes
// of course, the compiler can always generate pointer-incrementing asm from array-indexing source
const double *start = aligned_array;
while ( (ptrdiff_t)start & 31 ) {
// annoying loops like this are the reason people use aligned buffers
sum += *start++; // scalar until we reach 32B alignment
// in practice, this loop doesn't run, because we copy into an aligned buffer
// This will also require a cleanup loop, and break our multiple-of-16 doubles assumption.
}
const v4df *end = (v4df *)(aligned_array+ARRAY_SIZE);
for (const v4df *p = (v4df *)start ; p+3 < end; p+=4) {
sum0 += p[0]; // p+=4 increments the pointer by 4 * 4 * 8 bytes
sum1 += p[1]; // make sure you keep track of what you're incrementing
sum2 += p[2];
sum3 += p[3];
}
// the compiler might be smart enough to pull this out of the inner loop
// in fact, gcc turns this into a 64bit movabs outside of both loops :P
help+= ARRAY_SIZE;
// ... and this one. But your inner loop must do the same
// number of additions as this one does.
/* You could argue legalese and say that
if (i == 0) {
for (j ...)
sum += array[j];
sum *= N_TIMES;
}
* still does as many adds in its *INNER LOOP*, but it just doesn't run it as often
*/
}
// You can add some final code between this comment ...
sum0 = (sum0 + sum1) + (sum2 + sum3);
sum += sum0[0] + sum0[1] + sum0[2] + sum0[3];
printf("sum = %g; help=%ld\n", sum, help); // defeat the compiler.
free (aligned_array);
free (array); // not strictly necessary, because this is the end of main(). Leaving it out for this special case is a bad example for a CS class, though.
// ... and this one.
return 0;
}
The inner loop compiles to:
4007c0: c5 e5 58 19 vaddpd (%rcx),%ymm3,%ymm3
4007c4: 48 83 e9 80 sub $0xffffffffffffff80,%rcx # subtract -128, because -128 fits in imm8 instead of requiring an imm32 to encode add $128, %rcx
4007c8: c5 f5 58 49 a0 vaddpd -0x60(%rcx),%ymm1,%ymm1 # one-register addressing mode can micro-fuse
4007cd: c5 ed 58 51 c0 vaddpd -0x40(%rcx),%ymm2,%ymm2
4007d2: c5 fd 58 41 e0 vaddpd -0x20(%rcx),%ymm0,%ymm0
4007d7: 4c 39 c1 cmp %r8,%rcx # compare with end with p
4007da: 75 e4 jne 4007c0 <main+0xb0>
(For more, see online compiler output at the godbolt compiler explorer. The -xc compiler option compiles as C, not C++. The inner loop is from .L3 to jne .L3. See the x86 tag wiki for x86 asm links. See also this q&a about micro-fusion not happening on SnB-family, which Agner Fog's guides don't cover).
performance:
$ perf stat -e task-clock,cycles,instructions,r1b1,r10e,stalled-cycles-frontend,stalled-cycles-backend,L1-dcache-load-misses,cache-misses ./fl3-vec
CS201 - Asgmt 4 - I. Forgot
sum = 0; help=6000000000
Performance counter stats for './fl3-vec':
1086.571078 task-clock (msec) # 1.000 CPUs utilized
4,072,679,849 cycles # 3.748 GHz
2,629,419,883 instructions # 0.65 insns per cycle
# 1.27 stalled cycles per insn
4,028,715,968 r1b1 # 3707.733 M/sec # unfused uops
2,257,875,023 r10e # 2077.982 M/sec # fused uops. lower than insns because of macro-fusion
3,328,275,626 stalled-cycles-frontend # 81.72% frontend cycles idle
1,648,011,059 stalled-cycles-backend # 40.47% backend cycles idle
751,736,741 L1-dcache-load-misses # 691.843 M/sec
18,772 cache-misses # 0.017 M/sec
1.086925466 seconds time elapsed
I still don't know why it's getting such low instructions per cycle. The inner loop is using 4 separate accumulators, and I checked with gdb that the pointers are aligned. So cache-bank conflicts shouldn't be the problem. Sandybridge L2 cache can sustain one 32B transfers per cycle, which should keep up with the one 32B FP vector add per cycle.
32B loads from L1 take 2 cycles (it wasn't until Haswell that Intel made 32B loads a single-cycle operation). However, there are 2 load ports, so the sustained throughput is 32B per cycle (which we're not reaching).
Perhaps the loads need to be pipelined ahead of when they're used, to minimize having the ROB (re-order buffer) fill up when a load stalls? But the perf counters indicate a fairly high L1 cache hit rate, so hardware prefetch from L2 to L1 seems to be doing its job.
0.65 instructions per cycle is only about half way to saturating the vector FP adder. This is frustrating. Even IACA says the loop should run in 4 cycles per iteration. (i.e. saturate the load ports and port1 (where the FP adder lives)) :/
update: I guess L2 bandwidth was the problem after all. There aren't enough line-fill buffers to keep enough misses in flight to sustain the peak throughput every cycle. L2 sustained bandwidth is less than peak on Intel SnB / Haswell / Skylake CPUs.
See also Single Threaded Memory Bandwidth on Sandy Bridge (Intel forum thread, with much discussion about what limits throughput, and how latency * max_concurrency is one possible bottleneck. See also the "Latency Bound Platforms" part of the answer to Enhanced REP MOVSB for memcpy limited memory concurrency is a bottleneck for loads as well as stores, but for loads prefetch into L2 does mean you might not be limited purely by Line Fill buffers for outstanding L1D misses.
Reducing ARRAY_SIZE to 1008 (multiple of 16), and increasing N_TIMES by a factor of 10, brought the runtime down to 0.5s. That's 1.68 insns per cycle. (The inner loop is 7 total instructions for 4 FP adds, thus we are finally saturating the vector FP add unit, and the load ports.) Loop tiling is a much better solution, see below.
Intel CPUs only have 32k each L1-data and L1-instruction caches. I think your array would just barely fit in the 64kiB L1D on an AMD K10 (Istanbul) CPU, but not Bulldozer-family (16kiB L1D) or Ryzen (32kiB L1D).
Gcc's attempt to vectorize by broadcasting the same value into a parallel add doesn't seem so crazy. If it had managed to get this right (using multiple accumulators to hide latency), that would have allowed it to saturate the vector FP adder with only half the memory bandwidth. As-is, it was pretty much a wash, probably because of overhead in broadcasting.
Also, it's pretty silly. The N_TIMES is a just a make-work repeat. We don't actually want to optimize for doing the identical work multiple times. Unless we want to win at silly assignments like this. A source-level way to do this would be to increment i in the part of the code we're allowed to modify:
for (...) {
sum += a[j] + a[j] + a[j] + a[j];
}
i += 3; // The inner loop does 4 total iterations of the outer loop
More realistically, to deal with this you could interchange your loops (loop over the array once, adding each value N_TIMES times). I think I've read that Intel's compiler will sometimes do that for you.
A more general technique is called cache blocking, or loop tiling. The idea is to work on your input data in small blocks that fit in cache. Depending on your algorithm, it can be possible to do various stages of thing on a chunk, then repeat for the next chunk, instead of having each stage loop over the whole input. As always, once you know the right name for a trick (and that it exists at all), you can google up a ton of info.
You could rules-lawyer your way into putting an interchanged loop inside an if (i == 0) block in the part of the code you're allowed to modify. It would still do the same number of additions, but in a more cache-optimal order.
You may be on the right track, though you'll need to measure it to be certain (my normal advice to measure, not guess seems a little superfluous here since the whole point of the assignment is to measure).
Optimising compilers will probably not see much of a difference since they're pretty clever about that sort of stuff but, since we don't know what optimisation level it will be compiling at, you may get a substantial improvement.
To use pointers in the inner loop is a simple matter of first adding a pointer variable:
register double *pj;
then changing the loop to:
for (pj = &(array[0]); pj < &(array[ARRAY_SIZE]); j++) {
sum += *j++;
sum1 += *j++;
sum2 += *j++;
sum3 += *j++;
sum4 += *j++;
sum5 += *j++;
sum6 += *j++;
sum7 += *j++;
sum8 += *j++;
sum9 += *j;
}
This keeps the amount of additions the same within the loop (assuming you're counting += and ++ as addition operators, of course) but basically uses pointers rather than array indexes.
With no optimisation1 on my system, this drops it from 9.868 seconds (CPU time) to 4.84 seconds. Your mileage may vary.
1 With optimisation level -O3, both are reported as taking 0.001 seconds so, as mentioned, the optimisers are pretty clever. However, given you're seeing 5+ seconds, I'd suggest it wasn't been compiled with optimisation on.
As an aside, this is a good reason why it's usually advisable to write your code in a readable manner and let the compiler take care of getting it running faster. While my meager attempts at optimisation roughly doubled the speed, using -O3 made it run some ten thousand times faster :-)
Before anything else, try to change compiler settings to produce faster code. There is general optimisation, and the compiler might do auto vectorisation.
What you would always do is try several approaches and check what is fastest. As a target, try to get to one cycle per addition or better.
Number of iterations per loop: You add up 10 sums simultaneously. It might be that your processor doesn't have enough registers for that, or it has more. I'd measure the time for 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... sums per loop.
Number of sums: Having more than one sum means that latency doesn't bite you, just throughput. But more than four or six might not be helpful. Try four sums, with 4, 8, 12, 16 iterations per loop. Or six sums, with 6, 12, 18 iterations.
Caching: You are running through an array of 80,000 bytes. Probably more than L1 cache. Split the array into 2 or 4 parts. Do an outer loop iterating over the two or four subarrays, the next loop from 0 to N_TIMES - 1, and the inner loop adding up values.
And then you can try using vector operations, or multi-threading your code, or using the GPU to do the work.
And if you are forced to use no optimisation, then the "register" keyword might actually work.

optimized sum of an array of doubles in C [duplicate]

This question already has answers here:
How to optimize these loops (with compiler optimization disabled)?
(3 answers)
Closed 5 years ago.
I've got an assignment where I must take a program and make it more efficient in terms of time.
the original code is:
#include <stdio.h>
#include <stdlib.h>
// You are only allowed to make changes to this code as specified by the comments in it.
// The code you submit must have these two values.
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main(void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
// You can add variables between this comment ...
long int help;
// ... and this one.
// Please change 'your name' to your actual name.
printf("CS201 - Asgmt 4 - I. Forgot\n");
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
int j;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += array[j];
help++;
}
// ... and this one. But your inner loop must do the same
// number of additions as this one does.
}
// You can add some final code between this comment ...
// ... and this one.
return 0;
}
I almost exclusively modified the second for loop by changing it to
double *j=array;
double *p=array+ARRAY_SIZE;
for(; j<p;j+=10){
sum += j[0]+j[1]+j[2]+j[3]+j[4]+j[5]+j[6]+j[7]+j[8]+j[9];
{
this on its own was able to reduce the time down to the criteria...
it already seems to work but are there any mistakes i'm not seeing?
I posted an improved version of this answer on a duplicate of this: C loop optimization help for final assignment. It was originally just a repost, but then I made some changes to answer the differences in that question. I forget what's different, but you should probably read that one instead. Maybe I should just delete this one.
See also other optimization guides in the x86 tag wiki.
First of all, it's a really crap sample because it doesn't have anything to stop a smart compiler from optimizing away the entire thing. It doesn't even print the sum. Even gcc -O1 (instead of -O3) threw away some of the looping.
Normally you'd put your code in a function, and call it in a loop from main() in another file. And compile them separately, without whole-program cross-file optimisation, so the compiler can't do optimisations based on the compile-time constants you call it with. The repeat-loop being wrapped so tightly around the actual loop over the array is causing havoc with gcc's optimizer (see below).
Also:
gcc -Wall -O3 -march=native fast-loop-cs201.c -o fl
fast-loop-cs201.c: In function ‘main’:
fast-loop-cs201.c:17:14: warning: ‘help’ is used uninitialized in this function [-Wuninitialized]
long int help;
I have to agree with EOF's disparaging remarks about your prof. Giving out code that optimizes away to nothing, and with uninitialized variables, is utter nonsense.
Some people are saying in comments that "the compiler doesn't matter", and that you're supposed to do optimize your C source for the CPU microarchitecture, rather than letting the compiler do it. This is crap: for good performance, you have to be aware of what compilers can do, and can't do. Some optimizations are "brittle", and a small seemingly-innocent change to the source will stop the compiler from doing something.
I assume your prof mentioned a few things about performance. There are a crapton of different things that could come into play here, many of which I assume didn't get mentioned in a 2nd-year CS class.
Besides multithreading with openmp, there's vectorizing with SIMD. There are also optimizations for modern pipelined CPUs: specifically, avoid having one long dependency chain.
Further essential reading:
Agner Fog's guides for optimizing C and asm for x86. Some of it applies to all CPUs.
What Every Programmer Should Know About Memory
Your compiler manual is also essential, esp. for floating point code. Floating point has limited precision, and is not associative. The final sum does depend on which order you do the additions in. However, usually the difference in rounding error is small. So the compiler can get a big speedup by re-ordering things if you use -ffast-math to allow it. This may have been what your unroll-by-10 allowed.
Instead of just unrolling, keeping multiple accumulators which you only add up at the end can keep the floating point execution units saturated, because FP instructions have latency != throughput. If you need the result of the last op to be complete before the next one can start, you're limited by latency. For FP add, that's one per 3 cycles. In Intel Sandybridge, IvB, Haswell, and Broadwell, the throughput of FP add is one per cycle. So you need to keep at least 3 independent ops that can be in flight at once to saturate the machine. For Skylake, it's 2 per cycle with latency of 4 clocks. (On the plus side for Skylake, FMA is down to 4 cycle latency.)
In this case, there's also basic stuff like pulling things out of the loop, e.g. help += ARRAY_SIZE.
compiler options
I started out with the original inner loop, with just help += ARRAY_SIZE pulled out, and adding a printf at the end so gcc doesn't optimize everything away. Let's try some compiler options and see what we can achieve with gcc 4.9.2 (on my i5 2500k Sandybridge. 3.8GHz max turbo (slight OC), 3.3GHz sustained (irrelevant for this short benchmark)):
gcc -O0 fast-loop-cs201.c -o fl: 16.43s performance is a total joke. Variables are stored to memory after every operation, and re-loaded before the next. This is a bottleneck, and adds a lot of latency. Not to mention losing out on actual optimisations. Timing / tuning code with -O0 is not useful.
-O1: 4.87s
-O2: 4.89s
-O3: 2.453s (uses SSE to do 2 at once. I'm of course using a 64bit system, so hardware support for -msse2 is baseline.)
-O3 -ffast-math -funroll-loops: 2.439s
-O3 -march=sandybridge -ffast-math -funroll-loops: 1.275s (uses AVX to do 4 at once.)
-Ofast ...: no gain
-O3 -ftree-parallelize-loops=4 -march=sandybridge -ffast-math -funroll-loops: 0m2.375s real, 0m8.500s user. Looks like locking overhead killed it. It only spawns the 4 threads total, but the inner loop is too short for it to be a win (because it collects the sums every time, instead of giving one thread the first 1/4 of the outer loop iterations).
-Ofast -fprofile-generate -march=sandybridge -ffast-math, run it, then
-Ofast -fprofile-use -march=sandybridge -ffast-math: 1.275s
clang-3.5 -Ofast -march=native -ffast-math: 1.070s. (clang doesn't support -march=sandybridge).
gcc -O3 vectorizes in a hilarious way: The inner loop does 2 (or 4) iterations of the outer loop in parallel, by broadcasting one array element to all elements of an xmm (or ymm) register, and doing an addpd on that. So it sees the same values are being added repeatedly, but even -ffast-math doesn't let gcc just turn it into a multiply. Or switch the loops.
clang-3.5 vectorizes a lot better: it vectorizes the inner loop, instead of the outer, so it doesn't need to broadcast. It even uses 4 vector registers as 4 separate accumulators. However, it doesn't assume that calloc returns aligned memory, and for some reason it thinks the best bet is a pair of 128b loads.
vmovupd -0x60(%rbx,%rcx,8),%xmm4`
vinsertf128 $0x1,-0x50(%rbx,%rcx,8),%ymm4,%ymm4
It's actually slower when I tell it that the array is aligned. (with a stupid hack like array = (double*)((ptrdiff_t)array & ~31); which actually generates an instruction to mask off the low 5 bits, because clang-3.5 doesn't support gcc's __builtin_assume_aligned.) I think the way the tight loop of 4x vaddpd mem, %ymmX,%ymmX is aligned puts cmp $0x271c,%rcx crossing a 32B boundary, so it can't macro-fuse with jne. uop throughput shouldn't be an issue, though, since this code is only getting 0.65insns per cycle (and 0.93 uops / cycle), according to perf.
Ahh, I checked with a debugger, and calloc is only returning a 16B-aligned pointer. So half the 32B memory accesses are crossing a cache line, causing a big slowdown. I guess it is slightly faster to do two separate 16B loads when your pointer is 16B-aligned but not 32B-aligned, on Sandybridge. The compiler is making a good choice here.
Source level changes
As we can see from clang beating gcc, multiple accumulators are excellent. The most obvious way to do this would be:
for (j = 0; j < ARRAY_SIZE; j+=4) { // unroll 4 times
sum0 += array[j];
sum1 += array[j+1];
sum2 += array[j+2];
sum3 += array[j+3];
}
and then don't collect the 4 accumulators into one until after the end of the outer loop.
Your source change of
sum += j[0]+j[1]+j[2]+j[3]+j[4]+j[5]+j[6]+j[7]+j[8]+j[9];
actually has a similar effect, thanks to out-of-order execution. Each group of 10 is a separate dependency chain. order-of-operations rules say the j values get added together first, and then added to sum. So the loop-carried dependency chain is still only the latency of one FP add, and there's lots of independent work for each group of 10. Each group is a separate dependency chain of 9 adds, and takes few enough instructions for the out-of-order execution hardware to see the start of the next chain and, and find the parallelism to keep those medium latency, high throughput FP execution units fed.
With -O0, as your silly assignment apparently requires, values are stored to RAM at the end of every statement. (Technically, at every "sequence point", as the C standards call it.) Writing longer expressions without updating any variables, even temporaries, will make -O0 run faster, but it's not a useful optimisation. Don't waste your time on changes that only help with -O0, esp. not at the expense of readability.
Using 4-accumulators and not adding them together until the end of the outer loop defeats clang's auto-vectorizer. It still runs in only 1.66s (vs. 4.89 for gcc's non-vectorized -O2 with one accumulator). Even gcc -O2 without -ffast-math also gets 1.66s for this source change. Note that ARRAY_SIZE is known to be a multiple of 4, so I didn't include any cleanup code to handle the last up-to-3 elements (or to avoid reading past the end of the array, which would happen as written now). It's really easy to get something wrong and read past the end of the array when doing this.
gcc, on the other hand, does vectorize this, but it also pessimises (un-optimises) the inner loop into a single dependency chain. I think it's doing multiple iterations of the outer loop, again.
Using gcc's platform-independent vector extensions, I wrote a version which compiles into apparently-optimal code:
// compile with gcc -g -Wall -std=gnu11 -Ofast -fno-tree-vectorize -march=native fast-loop-cs201.vec.c -o fl3-vec
#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include <assert.h>
#include <string.h>
// You are only allowed to make changes to this code as specified by the comments in it.
// The code you submit must have these two values.
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main(void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
// You can add variables between this comment ...
long int help = 0;
typedef double v4df __attribute__ ((vector_size (8*4)));
v4df sum0={0}, sum1={0}, sum2={0}, sum3={0};
const size_t array_bytes = ARRAY_SIZE*sizeof(double);
double *aligned_array = NULL;
// this more-than-declaration could go in an if(i == 0) block for strict compliance with the rules
if ( posix_memalign((void**)&aligned_array, 32, array_bytes) ) {
exit (1);
}
memcpy(aligned_array, array, array_bytes); // In this one case: faster to align once and have no extra overhead for N_TIMES through the loop
// ... and this one.
// Please change 'your name' to your actual name.
printf("CS201 - Asgmt 4 - I. Forgot\n");
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
/*
#if defined(__GNUC__) && (__GNUC__ * 100 + __GNUC_MINOR__) >= 407 // GCC 4.7 or later.
array = __builtin_assume_aligned(array, 32);
#else
// force-align for other compilers. This loop-invariant will be done outside the loop.
array = (double*) ((ptrdiff_t)array & ~31);
#endif
*/
assert ( ARRAY_SIZE / (4*4) == (ARRAY_SIZE+15) / (4*4) ); // We don't have a cleanup loop to handle where the array size isn't a multiple of 16
// incrementing pointers can be more efficient than indexing arrays
// esp. on recent Intel where micro-fusion only works with one-register addressing modes
// of course, the compiler can always generate pointer-incrementing asm from array-indexing source
const double *start = aligned_array;
while ( (ptrdiff_t)start & 31 ) {
// annoying loops like this are the reason people use aligned buffers
sum += *start++; // scalar until we reach 32B alignment
// in practice, this loop doesn't run, because we copy into an aligned buffer
// This will also require a cleanup loop, and break our multiple-of-16 doubles assumption.
}
const v4df *end = (v4df *)(aligned_array+ARRAY_SIZE);
for (const v4df *p = (v4df *)start ; p+3 < end; p+=4) {
sum0 += p[0]; // p+=4 increments the pointer by 4 * 4 * 8 bytes
sum1 += p[1]; // make sure you keep track of what you're incrementing
sum2 += p[2];
sum3 += p[3];
}
// the compiler might be smart enough to pull this out of the inner loop
// in fact, gcc turns this into a 64bit movabs outside of both loops :P
help+= ARRAY_SIZE;
// ... and this one. But your inner loop must do the same
// number of additions as this one does.
/* You could argue legalese and say that
if (i == 0) {
for (j ...)
sum += array[j];
sum *= N_TIMES;
}
* still does as many adds in its *INNER LOOP*, but it just doesn't run it as often
*/
}
// You can add some final code between this comment ...
sum0 = (sum0 + sum1) + (sum2 + sum3);
sum += sum0[0] + sum0[1] + sum0[2] + sum0[3];
printf("sum = %g; help=%ld\n", sum, help); // defeat the compiler.
free (aligned_array);
free (array); // not strictly necessary, because this is the end of main(). Leaving it out for this special case is a bad example for a CS class, though.
// ... and this one.
return 0;
}
The inner loop compiles to:
4007c0: c5 e5 58 19 vaddpd (%rcx),%ymm3,%ymm3
4007c4: 48 83 e9 80 sub $0xffffffffffffff80,%rcx # subtract -128, because -128 fits in imm8 instead of requiring an imm32 to encode add $128, %rcx
4007c8: c5 f5 58 49 a0 vaddpd -0x60(%rcx),%ymm1,%ymm1 # one-register addressing mode can micro-fuse
4007cd: c5 ed 58 51 c0 vaddpd -0x40(%rcx),%ymm2,%ymm2
4007d2: c5 fd 58 41 e0 vaddpd -0x20(%rcx),%ymm0,%ymm0
4007d7: 4c 39 c1 cmp %r8,%rcx # compare with end with p
4007da: 75 e4 jne 4007c0 <main+0xb0>
(For more, see online compiler output at godbolt. Note I had to cast the return value of calloc, because godbolt uses C++ compilers, not C compilers. The inner loop is from .L3 to jne .L3. See https://stackoverflow.com/tags/x86/info for x86 asm links. See also Micro fusion and addressing modes, because this Sandybridge change hasn't made it into Agner Fog's manuals yet.).
performance:
$ perf stat -e task-clock,cycles,instructions,r1b1,r10e,stalled-cycles-frontend,stalled-cycles-backend,L1-dcache-load-misses,cache-misses ./fl3-vec
CS201 - Asgmt 4 - I. Forgot
sum = 0; help=6000000000
Performance counter stats for './fl3-vec':
1086.571078 task-clock (msec) # 1.000 CPUs utilized
4,072,679,849 cycles # 3.748 GHz
2,629,419,883 instructions # 0.65 insns per cycle
# 1.27 stalled cycles per insn
4,028,715,968 r1b1 # 3707.733 M/sec # unfused uops
2,257,875,023 r10e # 2077.982 M/sec # fused uops. lower than insns because of macro-fusion
3,328,275,626 stalled-cycles-frontend # 81.72% frontend cycles idle
1,648,011,059 stalled-cycles-backend # 40.47% backend cycles idle
751,736,741 L1-dcache-load-misses # 691.843 M/sec
18,772 cache-misses # 0.017 M/sec
1.086925466 seconds time elapsed
I still don't know why it's getting such low instructions per cycle. The inner loop is using 4 separate accumulators, and I checked with gdb that the pointers are aligned. So cache-bank conflicts shouldn't be the problem. Sandybridge L2 cache can sustain one 32B transfers per cycle, which should keep up with the one 32B FP vector add per cycle.
Loads 32B loads from L1 take 2 cycles (it wasn't until Haswell that Intel made 32B loads a single-cycle operation). However, there are 2 load ports, so the sustained throughput is 32B per cycle (which we're not reaching).
Perhaps the loads need to be pipelined ahead of when they're used, to minimize having the ROB (re-order buffer) fill up when a load stalls? But the perf counters indicate a fairly high L1 cache hit rate, so hardware prefetch from L2 to L1 seems to be doing its job.
0.65 instructions per cycle is only about half way to saturating the vector FP adder. This is frustrating. Even IACA says the loop should run in 4 cycles per iteration. (i.e. saturate the load ports and port1 (where the FP adder lives)) :/
update: I guess L2 latency was the problem after all. Reducing ARRAY_SIZE to 1008 (multiple of 16), and increasing N_TIMES by a factor of 10, brought the runtime down to 0.5s. That's 1.68 insns per cycle. (The inner loop is 7 total instructions for 4 FP adds, thus we are finally saturating the vector FP add unit, and the load ports.) IDK why the HW prefetcher can't get ahead after one stall, and then stay ahead. Possibly software prefetch could help? Maybe somehow avoid having the HW prefetcher run past the array, and instead start prefetching the start of the array again. (loop tiling is a much better solution, see below.)
Intel CPUs only have 32k each L1-data and L1-instruction caches. I think your array would just barely fit in the L1 on an AMD CPU.
Gcc's attempt to vectorize by broadcasting the same value into a parallel add doesn't seem so crazy. If it had managed to get this right (using multiple accumulators to hide latency), that would have allowed it to saturate the vector FP adder with only half the memory bandwidth. As-is, it was pretty much a wash, probably because of overhead in broadcasting.
Also, it's pretty silly. The N_TIMES is a just a make-work repeat. We don't actually want to optimize for doing the identical work multiple times. Unless we want to win at silly assignments like this. A source-level way to do this would be to increment i in the part of the code we're allowed to modify:
for (...) {
sum += a[j] + a[j] + a[j] + a[j];
}
i += 3; // The inner loop does 4 total iterations of the outer loop
More realistically, to deal with this you could interchange your loops (loop over the array once, adding each value N_TIMES times). I think I've read that Intel's compiler will sometimes do that for you.
A more general technique is called cache blocking, or loop tiling. The idea is to work on your input data in small blocks that fit in cache. Depending on your algorithm, it can be possible to do various stages of thing on a chunk, then repeat for the next chunk, instead of having each stage loop over the whole input. As always, once you know the right name for a trick (and that it exists at all), you can google up a ton of info.
You could rules-lawyer your way into putting an interchanged loop inside an if (i == 0) block in the part of the code you're allowed to modify. It would still do the same number of additions, but in a more cache-optimal order.
I would try this for the inner loop:
double* tmp = array;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += *tmp; // Use a pointer
tmp++; // because it is faster to increment the pointer
// than it is to recalculate array+j every time
help++;
}
or better
double* tmp = array;
double* end = array + ARRAY_SIZE; // Get rid of variable j by calculating
// the end criteria and
while (tmp != end) { // just compare if the end is reached
sum += *tmp;
tmp++;
help++;
}
I think You should read about openmp library if You could use multithreaded. But this is so simple example that I think could not be optimized.
Certain thing is that You don't need to declare i and j before for loop. That would do:
for (int i = 0; i < N_TIMES; i++)

why does GCC __builtin_prefetch not improve performance?

I'm writing a program to analyze a graph of social network. It means the program needs a lot of random memory accesses. It seems to me prefetch should help. Here is a small piece of the code of reading values from neighbors of a vertex.
for (size_t i = 0; i < v.get_num_edges(); i++) {
unsigned int id = v.neighbors[i];
res += neigh_vals[id];
}
I transform the code above to the one as below and prefetch the values of the neighbors of a vertex.
int *neigh_vals = new int[num_vertices];
for (size_t i = 0; i < v.get_num_edges(); i += 128) {
size_t this_end = std::min(v.get_num_edges(), i + 128);
for (size_t j = i; j < this_end; j++) {
unsigned int id = v.neighbors[j];
__builtin_prefetch(&neigh_vals[id], 0, 2);
}
for (size_t j = i; j < this_end; j++) {
unsigned int id = v.neighbors[j];
res += neigh_vals[id];
}
}
In this C++ code, I didn't override any operators.
Unfortunately, the code doesn't really improve the performance. I wonder why. Apparently, hardware prefetch doesn't work in this case because the hardware can't predict the memory location.
I wonder if it's caused by GCC optimization. When I compile the code, I enable -O3. I really hope prefetch can further improve performance even when -O3 is enabled. Does -O3 optimization fuse the two loops in this case? Can -O3 enable prefetch in this case by default?
I use gcc version 4.6.3 and the program runs on Intel Xeon E5-4620.
Thanks,
Da
Yes, some recent versions of GCC (e.g. 4.9 in march 2015) are able to issue some PREFETCH instruction when optimizing with -O3 (even without any explicit __builtin_prefetch)
We don't know what get_neighbor is doing, and what are the types of v and neigh_val.
And prefetching is not always profitable. Adding explicit __builtin_prefetch can slow down your code. You need to measure.
As Retired Ninja commented, prefetching in one loop and hoping data would be cached in the following loop (further down in your source code) is wrong.
You might perhaps try instead
for (size_t i = 0; i < v.get_num_edges(); i++) {
fg::vertex_id_t id = v.get_neighbor(i);
__builtin_prefetch (neigh_val[v.get_neighbor(i+4)]);
res += neigh_vals[id];
}
You could empirically replace the 4 with whatever appropriate constant is the best.
But I guess that the __builtin_prefetch above is useless (since the compiler is probably able to add it by itself) and it could harm (or even crash the program, when computing its argument gives undefined behavior, e.g. if v.get_neighbor(i+4) is undefined; however prefetching an address outside of your address space won't harm -but could slow down your program). Please benchmark.
See this answer to a related question.
Notice that in C++ all of [], get_neighbor could be overloaded and becomes very complex operations, so we cannot guess!
And there are cases where the hardware is limiting performance, whatever __builtin_prefetch you add (and adding them could hurt performance)
BTW, you might pass -O3 -mtune=native -fdump-tree-ssa -S -fverbose-asm to understand more what the compiler is doing (and look inside generated dump files and assembler files); also, it does happen that -O3 produces slightly slower code than what -O2 gives.
You could consider explicit multithreading, OpenMP, OpenCL if you have time to waste on optimization. Remember that premature optimization is evil. Did you benchmark, did you profile your entire application?

Resources