SSE intrinsics without compiler optimization

SSE intrinsics without compiler optimization - c

I am new to SSE intrinsics and try to optimise my code by it. Here is my program about counting array elements which are equal to the given value.
I changed my code to SSE version but the speed almost doesn't change. I am wondering whether I use SSE in a wrong way...
This code is for an assignment where we're not allowed to enable compiler optimization options.
No SSE version:
int get_freq(const float* matrix, float value) {
int freq = 0;
for (ssize_t i = start; i < end; i++) {
if (fabsf(matrix[i] - value) <= FLT_EPSILON) {
freq++;
}
}
return freq;
}
SSE version:
#include <immintrin.h>
#include <math.h>
#include <float.h>
#define GETLOAD(n) __m128 load##n = _mm_load_ps(&matrix[i + 4 * n])
#define GETEQU(n) __m128 check##n = _mm_and_ps(_mm_cmpeq_ps(load##n, value), and_value)
#define GETCOUNT(n) count = _mm_add_ps(count, check##n)
int get_freq(const float* matrix, float givenValue, ssize_t g_elements) {
int freq = 0;
int i;
__m128 value = _mm_set1_ps(givenValue);
__m128 count = _mm_setzero_ps();
__m128 and_value = _mm_set1_ps(0x00000001);
for (i = 0; i + 15 < g_elements; i += 16) {
GETLOAD(0); GETLOAD(1); GETLOAD(2); GETLOAD(3);
GETEQU(0); GETEQU(1); GETEQU(2); GETEQU(3);
GETCOUNT(0);GETCOUNT(1);GETCOUNT(2);GETCOUNT(3);
}
__m128 shuffle_a = _mm_shuffle_ps(count, count, _MM_SHUFFLE(1, 0, 3, 2));
count = _mm_add_ps(count, shuffle_a);
__m128 shuffle_b = _mm_shuffle_ps(count, count, _MM_SHUFFLE(2, 3, 0, 1));
count = _mm_add_ps(count, shuffle_b);
freq = _mm_cvtss_si32(count);
for (; i < g_elements; i++) {
if (fabsf(matrix[i] - givenValue) <= FLT_EPSILON) {
freq++;
}
}
return freq;
}

If you need to compile with -O0, then do as much as possible in a single statement. In normal code, int a=foo(); bar(a); will compile to the same asm as bar(foo()), but in -O0 code, the second version will probably be faster, because it doesn't store the result to memory and then reload it for the next statement.
-O0 is designed to give the most predictable results from debugging, which is why everything is stored to memory after every statement. This is obviously horrible for performance.
I wrote a big answer a while ago for a different question from someone else with a stupid assignment like yours that required them to optimize for -O0. Some of that may help.
Don't try too hard on this assignment. Probably most of the "tricks" that you figure out that make your code run faster with -O0 will only matter for -O0, but make no difference with optimization enabled.
In real life, code is typically compiled with clang or gcc -O2 at least, and sometimes -O3 -march=haswell or whatever to auto-vectorize. (Once it's debugged and you're ready to optimize.)
Re: your update:
Now it compiles, and the horrible asm from the SSE version can be seen. I put it on godbolt along with a version of the scalar code that actually compiles, too. Intrinsics usually compile very badly with optimization disabled, with the inline functions still having args and return values that result in actual load/store round trips (store-forwarding latency) even with __attribute__((always_inline)). See Demonstrator code failing to show 4 times faster SIMD speed with optimization disabled for example.
The scalar version comes out a lot less bad. Its source does everything in one expression, so temporaries stay in registers. The loop counter is still in memory, though, bottlenecking it to at best one iteration per 6 cycles on Haswell, for example. (See the x86 tag wiki for optimization resources.)
BTW, a vectorized fabsf() is easy, see Fastest way to compute absolute value using SSE. That and an SSE compare for less-than should do the trick to give you the same semantics as your scalar code. (But makes it even harder to get -O0 to not suck).
You might do better just manually unrolling your scalar version one or two times, because -O0 sucks too much.

Some compilers are pretty good about doing optimization of vectors. Did you check the generated assembly of optimized build of both versions? Isn't the "naive" version actually using SIMD or other optimization techniques?

Related

how to properly do multiply accumulate with NEON intrinsics

i need to do a simple multiply accumulate of two signed 8 bit arrays.
This routine runs every millisecond on an ARM7 embedded device. I am trying to speed it up a bit. I have already tried optimizing and enabling vector ops.
-mtune=cortex-a15.cortex-a7 -mfpu=neon-vfpv4 -ftree-vectorize -ffast-math -mfloat-abi=hard
this helped but I am still running close to the edge.
this is the 'c' code.
for(i = 4095; i >= 0; --i)
{
accum += arr1[i]*arr2[i];
}
I am trying to use NEON intrinsics. This loop runs ~5 times faster, but I get different results. I am pretty sure I am not properly retrieving the accumulation, or it rolls over before I do. Any help/pointers is greatly appreciated. Any detailed docs would also be helpful.
for(int i = 256; i > 0; --i)
{
int8x16_t vec16a = vld1q_s8(&arr1[index]);
int8x16_t vec16b = vld1q_s8(&arr2[index]);
vec16res = vmlaq_s8(vec16res, vec16a, vec16b);
index+=16;
}
EDIT to post solution.
Thanks to tips from all!
I dropped to to 8x8 and have a fast solution
using the below code I achieved a "fast enough" time. Not as fast as the 128bit version but good enough.
I added __builtin_prefetch() for the data, and did a 10 pass avg.
Neon is substantially faster.
$ ./test 10
original code time ~ 30392nS
optimized C time ~ 8458nS
NEON elapsed time ~ 3199nS
int32_t sum = 0;
int16x8_t vecSum = vdupq_n_s16(0);
int8x8_t vec8a;
int8x8_t vec8b;
int32x4_t sum32x4;
int32x2_t sum32x2;
#pragma unroll
for (i = 512; i > 0; --i)
{
vec8a = vld1_s8(&A[index]);
vec8b = vld1_s8(&B[index]);
vecSum = vmlal_s8(vecSum,vec8a,vec8b);
index += 8;
}
sum32x4 = vaddl_s16(vget_high_s16(vecSum),vget_low_s16(vecSum));
sum32x2 = vadd_s32(vget_high_s32(sum32x4),vget_low_s32(sum32x4));
sum += vget_lane_s32(vpadd_s32(sum32x2,sum32x2),0);

Your issue is likely overflow, so you'll need to lengthen when you do your multiply-accumulate.
As you're on ARMv7, you'll want vmlal_s8.
ARMv8 A64 has vmlal_high_s8 which allows you to stay in 128-bit vectors, which will give an added speed-up.
As mentioned in comments, seeing what auto-vectorization will do with -O options / pragma unroll is very valuable, and learning from the godbolt of that. Unrolling often gives speed-ups when doing by hand as well.
Lots more valuable tips on optimization in the Arm Neon resources.

How can I effectively time the execution of a function that's only a few cycles long?

I'm trying to do some comparisons on different methods for calculating dot products using SSE Intrinsics, but since the methods are only a few cycles long, I have to run the instructions trillions of times for it to take more than a tiny fraction of a second. The only problem with that is that gcc with the -O3 flag is "optimizing" my main method into an infinite loop.
My code is
#include <immintrin.h>
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <inttypes.h>
#define NORMAL 0
struct _Vec3 {
float x;
float y;
float z;
float w;
};
typedef struct _Vec3 Vec3;
__m128 singleDot(__m128 a, __m128 b) {
return _mm_dp_ps(a, b, 0b00001111);
}
int main(int argc, char** argv) {
for (uint16_t j = 0; j < (1L << 16); j++) {
for (uint64_t i = 0; i < (1L << 62); i++) {
Vec3 a = {i, i + 0.5, i + 1, 0.0};
Vec3 b = {i, i - 0.5, i - 1, 0.0};
#if NORMAL
float ans = normalDot(a, b); // naive implementation
#else
// float _c[4] = {a.x, a.y, a.z, 0.0};
// float _d[4] = {b.x, b.y, b.z, 0.0};
__m128 c = _mm_load_ps((float*)&a);
__m128 d = _mm_load_ps((float*)&b);
__m128 ans = singleDot(c, d);
#endif
}
}
}
but when I compile with gcc -std=c11 -march=native -O3 main.c and run objdump -d, it turns main into
0000000000400400 <main>:
400400: eb fe jmp 400400 <main>
is there an alternative for timing different approaches?

That's because this:
for (uint16_t j = 0; j < (1L << 16); j++) {
is an infinte loop -- the maximum value for a uint16_t is 65535 (216-1), after which it will wrap back to 0. So the test will always be true.

Even after fixing the uint16_t instead of uint64_t typo that makes your loop infinite, the actual work would still be optimized away because nothing uses the result.
You can use Google Benchmark's DoNotOptimize to stop your unused ans result from being optimized away. e.g. functions like "Escape" and "Clobber" that this Q&A is asking about. That works in GCC, and that question links to a relevant youtube video from a clang developer's CppCon talk.
Another worse way is to assign the result to a volatile variable. But keep in mind that common-subexpression elimination can still optimize away earlier parts of the calculation, whether you use volatile or an inline-asm macro to make sure the compiler materializes the actual final result somewhere. Micro-benchmarking is hard. You need the compiler to do exactly the amount of work that would happen in the real use-case, but not more.
See Idiomatic way of performance evaluation? for that and more.
Keep in mind exactly what you're measuring here.
Probably a bunch of loop overhead and probably store-forwarding stalls depending on whether the compiler vectorizes those initializers or not, but even if it does; conversion of integer to FP and 2x SIMD FP additions are comparable in cost a dpps in terms of throughput cost. (Which is what you're measuring, not latency; the difference matters a lot on CPUs with out-of-order execution depending on the context of your real use case).
Performance is not 1-dimensional at the scale of a couple instructions. Slapping a repeat loop around some work can measure the throughput or latency, depending on whether you make the input dependent on the previous output (a loop-carried dependency chain). But if your work ends up bound on front-end throughput, then loop overhead is an important part. Plus you might end up with effects due to how the machine code for your loop lines up with 32-byte boundaries for the uop cache.
For something this short and simple, static analysis is usually good. Count uops for the front-end, and ports in the back end, and analyze latency. What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?. LLVM-MCA can do this for you, so can IACA. You can also measure as part of your real loop that uses dot products.
See also RDTSCP in NASM always returns the same value for some discussion of what you can measure about a single instruction.
I have to run the instructions trillions of times for it to take more than a tiny fraction of a second
Current x86 CPUs can loop at best one iteration per clock cycle for a tiny loop. It's impossible to write a loop that runs faster than that. 4 billion iterations (in asm) will take at least a whole second on a 4GHz CPU.
Of course an optimizing C compiler could unroll your loop and be doing as many source iterations as it wants per asm jump.

optimized sum of an array of doubles in C [duplicate]

This question already has answers here:
How to optimize these loops (with compiler optimization disabled)?
(3 answers)
Closed 5 years ago.
I've got an assignment where I must take a program and make it more efficient in terms of time.
the original code is:
#include <stdio.h>
#include <stdlib.h>
// You are only allowed to make changes to this code as specified by the comments in it.
// The code you submit must have these two values.
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main(void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
// You can add variables between this comment ...
long int help;
// ... and this one.
// Please change 'your name' to your actual name.
printf("CS201 - Asgmt 4 - I. Forgot\n");
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
int j;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += array[j];
help++;
}
// ... and this one. But your inner loop must do the same
// number of additions as this one does.
}
// You can add some final code between this comment ...
// ... and this one.
return 0;
}
I almost exclusively modified the second for loop by changing it to
double *j=array;
double *p=array+ARRAY_SIZE;
for(; j<p;j+=10){
sum += j[0]+j[1]+j[2]+j[3]+j[4]+j[5]+j[6]+j[7]+j[8]+j[9];
{
this on its own was able to reduce the time down to the criteria...
it already seems to work but are there any mistakes i'm not seeing?

I posted an improved version of this answer on a duplicate of this: C loop optimization help for final assignment. It was originally just a repost, but then I made some changes to answer the differences in that question. I forget what's different, but you should probably read that one instead. Maybe I should just delete this one.
See also other optimization guides in the x86 tag wiki.
First of all, it's a really crap sample because it doesn't have anything to stop a smart compiler from optimizing away the entire thing. It doesn't even print the sum. Even gcc -O1 (instead of -O3) threw away some of the looping.
Normally you'd put your code in a function, and call it in a loop from main() in another file. And compile them separately, without whole-program cross-file optimisation, so the compiler can't do optimisations based on the compile-time constants you call it with. The repeat-loop being wrapped so tightly around the actual loop over the array is causing havoc with gcc's optimizer (see below).
Also:
gcc -Wall -O3 -march=native fast-loop-cs201.c -o fl
fast-loop-cs201.c: In function ‘main’:
fast-loop-cs201.c:17:14: warning: ‘help’ is used uninitialized in this function [-Wuninitialized]
long int help;
I have to agree with EOF's disparaging remarks about your prof. Giving out code that optimizes away to nothing, and with uninitialized variables, is utter nonsense.
Some people are saying in comments that "the compiler doesn't matter", and that you're supposed to do optimize your C source for the CPU microarchitecture, rather than letting the compiler do it. This is crap: for good performance, you have to be aware of what compilers can do, and can't do. Some optimizations are "brittle", and a small seemingly-innocent change to the source will stop the compiler from doing something.
I assume your prof mentioned a few things about performance. There are a crapton of different things that could come into play here, many of which I assume didn't get mentioned in a 2nd-year CS class.
Besides multithreading with openmp, there's vectorizing with SIMD. There are also optimizations for modern pipelined CPUs: specifically, avoid having one long dependency chain.
Further essential reading:
Agner Fog's guides for optimizing C and asm for x86. Some of it applies to all CPUs.
What Every Programmer Should Know About Memory
Your compiler manual is also essential, esp. for floating point code. Floating point has limited precision, and is not associative. The final sum does depend on which order you do the additions in. However, usually the difference in rounding error is small. So the compiler can get a big speedup by re-ordering things if you use -ffast-math to allow it. This may have been what your unroll-by-10 allowed.
Instead of just unrolling, keeping multiple accumulators which you only add up at the end can keep the floating point execution units saturated, because FP instructions have latency != throughput. If you need the result of the last op to be complete before the next one can start, you're limited by latency. For FP add, that's one per 3 cycles. In Intel Sandybridge, IvB, Haswell, and Broadwell, the throughput of FP add is one per cycle. So you need to keep at least 3 independent ops that can be in flight at once to saturate the machine. For Skylake, it's 2 per cycle with latency of 4 clocks. (On the plus side for Skylake, FMA is down to 4 cycle latency.)
In this case, there's also basic stuff like pulling things out of the loop, e.g. help += ARRAY_SIZE.
compiler options
I started out with the original inner loop, with just help += ARRAY_SIZE pulled out, and adding a printf at the end so gcc doesn't optimize everything away. Let's try some compiler options and see what we can achieve with gcc 4.9.2 (on my i5 2500k Sandybridge. 3.8GHz max turbo (slight OC), 3.3GHz sustained (irrelevant for this short benchmark)):
gcc -O0 fast-loop-cs201.c -o fl: 16.43s performance is a total joke. Variables are stored to memory after every operation, and re-loaded before the next. This is a bottleneck, and adds a lot of latency. Not to mention losing out on actual optimisations. Timing / tuning code with -O0 is not useful.
-O1: 4.87s
-O2: 4.89s
-O3: 2.453s (uses SSE to do 2 at once. I'm of course using a 64bit system, so hardware support for -msse2 is baseline.)
-O3 -ffast-math -funroll-loops: 2.439s
-O3 -march=sandybridge -ffast-math -funroll-loops: 1.275s (uses AVX to do 4 at once.)
-Ofast ...: no gain
-O3 -ftree-parallelize-loops=4 -march=sandybridge -ffast-math -funroll-loops: 0m2.375s real, 0m8.500s user. Looks like locking overhead killed it. It only spawns the 4 threads total, but the inner loop is too short for it to be a win (because it collects the sums every time, instead of giving one thread the first 1/4 of the outer loop iterations).
-Ofast -fprofile-generate -march=sandybridge -ffast-math, run it, then
-Ofast -fprofile-use -march=sandybridge -ffast-math: 1.275s
clang-3.5 -Ofast -march=native -ffast-math: 1.070s. (clang doesn't support -march=sandybridge).
gcc -O3 vectorizes in a hilarious way: The inner loop does 2 (or 4) iterations of the outer loop in parallel, by broadcasting one array element to all elements of an xmm (or ymm) register, and doing an addpd on that. So it sees the same values are being added repeatedly, but even -ffast-math doesn't let gcc just turn it into a multiply. Or switch the loops.
clang-3.5 vectorizes a lot better: it vectorizes the inner loop, instead of the outer, so it doesn't need to broadcast. It even uses 4 vector registers as 4 separate accumulators. However, it doesn't assume that calloc returns aligned memory, and for some reason it thinks the best bet is a pair of 128b loads.
vmovupd -0x60(%rbx,%rcx,8),%xmm4`
vinsertf128 $0x1,-0x50(%rbx,%rcx,8),%ymm4,%ymm4
It's actually slower when I tell it that the array is aligned. (with a stupid hack like array = (double*)((ptrdiff_t)array & ~31); which actually generates an instruction to mask off the low 5 bits, because clang-3.5 doesn't support gcc's __builtin_assume_aligned.) I think the way the tight loop of 4x vaddpd mem, %ymmX,%ymmX is aligned puts cmp $0x271c,%rcx crossing a 32B boundary, so it can't macro-fuse with jne. uop throughput shouldn't be an issue, though, since this code is only getting 0.65insns per cycle (and 0.93 uops / cycle), according to perf.
Ahh, I checked with a debugger, and calloc is only returning a 16B-aligned pointer. So half the 32B memory accesses are crossing a cache line, causing a big slowdown. I guess it is slightly faster to do two separate 16B loads when your pointer is 16B-aligned but not 32B-aligned, on Sandybridge. The compiler is making a good choice here.
Source level changes
As we can see from clang beating gcc, multiple accumulators are excellent. The most obvious way to do this would be:
for (j = 0; j < ARRAY_SIZE; j+=4) { // unroll 4 times
sum0 += array[j];
sum1 += array[j+1];
sum2 += array[j+2];
sum3 += array[j+3];
}
and then don't collect the 4 accumulators into one until after the end of the outer loop.
Your source change of
sum += j[0]+j[1]+j[2]+j[3]+j[4]+j[5]+j[6]+j[7]+j[8]+j[9];
actually has a similar effect, thanks to out-of-order execution. Each group of 10 is a separate dependency chain. order-of-operations rules say the j values get added together first, and then added to sum. So the loop-carried dependency chain is still only the latency of one FP add, and there's lots of independent work for each group of 10. Each group is a separate dependency chain of 9 adds, and takes few enough instructions for the out-of-order execution hardware to see the start of the next chain and, and find the parallelism to keep those medium latency, high throughput FP execution units fed.
With -O0, as your silly assignment apparently requires, values are stored to RAM at the end of every statement. (Technically, at every "sequence point", as the C standards call it.) Writing longer expressions without updating any variables, even temporaries, will make -O0 run faster, but it's not a useful optimisation. Don't waste your time on changes that only help with -O0, esp. not at the expense of readability.
Using 4-accumulators and not adding them together until the end of the outer loop defeats clang's auto-vectorizer. It still runs in only 1.66s (vs. 4.89 for gcc's non-vectorized -O2 with one accumulator). Even gcc -O2 without -ffast-math also gets 1.66s for this source change. Note that ARRAY_SIZE is known to be a multiple of 4, so I didn't include any cleanup code to handle the last up-to-3 elements (or to avoid reading past the end of the array, which would happen as written now). It's really easy to get something wrong and read past the end of the array when doing this.
gcc, on the other hand, does vectorize this, but it also pessimises (un-optimises) the inner loop into a single dependency chain. I think it's doing multiple iterations of the outer loop, again.
Using gcc's platform-independent vector extensions, I wrote a version which compiles into apparently-optimal code:
// compile with gcc -g -Wall -std=gnu11 -Ofast -fno-tree-vectorize -march=native fast-loop-cs201.vec.c -o fl3-vec
#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include <assert.h>
#include <string.h>
// You are only allowed to make changes to this code as specified by the comments in it.
// The code you submit must have these two values.
#define N_TIMES 600000
#define ARRAY_SIZE 10000
int main(void)
{
double *array = calloc(ARRAY_SIZE, sizeof(double));
double sum = 0;
int i;
// You can add variables between this comment ...
long int help = 0;
typedef double v4df __attribute__ ((vector_size (8*4)));
v4df sum0={0}, sum1={0}, sum2={0}, sum3={0};
const size_t array_bytes = ARRAY_SIZE*sizeof(double);
double *aligned_array = NULL;
// this more-than-declaration could go in an if(i == 0) block for strict compliance with the rules
if ( posix_memalign((void**)&aligned_array, 32, array_bytes) ) {
exit (1);
}
memcpy(aligned_array, array, array_bytes); // In this one case: faster to align once and have no extra overhead for N_TIMES through the loop
// ... and this one.
// Please change 'your name' to your actual name.
printf("CS201 - Asgmt 4 - I. Forgot\n");
for (i = 0; i < N_TIMES; i++) {
// You can change anything between this comment ...
/*
#if defined(__GNUC__) && (__GNUC__ * 100 + __GNUC_MINOR__) >= 407 // GCC 4.7 or later.
array = __builtin_assume_aligned(array, 32);
#else
// force-align for other compilers. This loop-invariant will be done outside the loop.
array = (double*) ((ptrdiff_t)array & ~31);
#endif
*/
assert ( ARRAY_SIZE / (4*4) == (ARRAY_SIZE+15) / (4*4) ); // We don't have a cleanup loop to handle where the array size isn't a multiple of 16
// incrementing pointers can be more efficient than indexing arrays
// esp. on recent Intel where micro-fusion only works with one-register addressing modes
// of course, the compiler can always generate pointer-incrementing asm from array-indexing source
const double *start = aligned_array;
while ( (ptrdiff_t)start & 31 ) {
// annoying loops like this are the reason people use aligned buffers
sum += *start++; // scalar until we reach 32B alignment
// in practice, this loop doesn't run, because we copy into an aligned buffer
// This will also require a cleanup loop, and break our multiple-of-16 doubles assumption.
}
const v4df *end = (v4df *)(aligned_array+ARRAY_SIZE);
for (const v4df *p = (v4df *)start ; p+3 < end; p+=4) {
sum0 += p[0]; // p+=4 increments the pointer by 4 * 4 * 8 bytes
sum1 += p[1]; // make sure you keep track of what you're incrementing
sum2 += p[2];
sum3 += p[3];
}
// the compiler might be smart enough to pull this out of the inner loop
// in fact, gcc turns this into a 64bit movabs outside of both loops :P
help+= ARRAY_SIZE;
// ... and this one. But your inner loop must do the same
// number of additions as this one does.
/* You could argue legalese and say that
if (i == 0) {
for (j ...)
sum += array[j];
sum *= N_TIMES;
}
* still does as many adds in its *INNER LOOP*, but it just doesn't run it as often
*/
}
// You can add some final code between this comment ...
sum0 = (sum0 + sum1) + (sum2 + sum3);
sum += sum0[0] + sum0[1] + sum0[2] + sum0[3];
printf("sum = %g; help=%ld\n", sum, help); // defeat the compiler.
free (aligned_array);
free (array); // not strictly necessary, because this is the end of main(). Leaving it out for this special case is a bad example for a CS class, though.
// ... and this one.
return 0;
}
The inner loop compiles to:
4007c0: c5 e5 58 19 vaddpd (%rcx),%ymm3,%ymm3
4007c4: 48 83 e9 80 sub $0xffffffffffffff80,%rcx # subtract -128, because -128 fits in imm8 instead of requiring an imm32 to encode add $128, %rcx
4007c8: c5 f5 58 49 a0 vaddpd -0x60(%rcx),%ymm1,%ymm1 # one-register addressing mode can micro-fuse
4007cd: c5 ed 58 51 c0 vaddpd -0x40(%rcx),%ymm2,%ymm2
4007d2: c5 fd 58 41 e0 vaddpd -0x20(%rcx),%ymm0,%ymm0
4007d7: 4c 39 c1 cmp %r8,%rcx # compare with end with p
4007da: 75 e4 jne 4007c0 <main+0xb0>
(For more, see online compiler output at godbolt. Note I had to cast the return value of calloc, because godbolt uses C++ compilers, not C compilers. The inner loop is from .L3 to jne .L3. See https://stackoverflow.com/tags/x86/info for x86 asm links. See also Micro fusion and addressing modes, because this Sandybridge change hasn't made it into Agner Fog's manuals yet.).
performance:
$ perf stat -e task-clock,cycles,instructions,r1b1,r10e,stalled-cycles-frontend,stalled-cycles-backend,L1-dcache-load-misses,cache-misses ./fl3-vec
CS201 - Asgmt 4 - I. Forgot
sum = 0; help=6000000000
Performance counter stats for './fl3-vec':
1086.571078 task-clock (msec) # 1.000 CPUs utilized
4,072,679,849 cycles # 3.748 GHz
2,629,419,883 instructions # 0.65 insns per cycle
# 1.27 stalled cycles per insn
4,028,715,968 r1b1 # 3707.733 M/sec # unfused uops
2,257,875,023 r10e # 2077.982 M/sec # fused uops. lower than insns because of macro-fusion
3,328,275,626 stalled-cycles-frontend # 81.72% frontend cycles idle
1,648,011,059 stalled-cycles-backend # 40.47% backend cycles idle
751,736,741 L1-dcache-load-misses # 691.843 M/sec
18,772 cache-misses # 0.017 M/sec
1.086925466 seconds time elapsed
I still don't know why it's getting such low instructions per cycle. The inner loop is using 4 separate accumulators, and I checked with gdb that the pointers are aligned. So cache-bank conflicts shouldn't be the problem. Sandybridge L2 cache can sustain one 32B transfers per cycle, which should keep up with the one 32B FP vector add per cycle.
Loads 32B loads from L1 take 2 cycles (it wasn't until Haswell that Intel made 32B loads a single-cycle operation). However, there are 2 load ports, so the sustained throughput is 32B per cycle (which we're not reaching).
Perhaps the loads need to be pipelined ahead of when they're used, to minimize having the ROB (re-order buffer) fill up when a load stalls? But the perf counters indicate a fairly high L1 cache hit rate, so hardware prefetch from L2 to L1 seems to be doing its job.
0.65 instructions per cycle is only about half way to saturating the vector FP adder. This is frustrating. Even IACA says the loop should run in 4 cycles per iteration. (i.e. saturate the load ports and port1 (where the FP adder lives)) :/
update: I guess L2 latency was the problem after all. Reducing ARRAY_SIZE to 1008 (multiple of 16), and increasing N_TIMES by a factor of 10, brought the runtime down to 0.5s. That's 1.68 insns per cycle. (The inner loop is 7 total instructions for 4 FP adds, thus we are finally saturating the vector FP add unit, and the load ports.) IDK why the HW prefetcher can't get ahead after one stall, and then stay ahead. Possibly software prefetch could help? Maybe somehow avoid having the HW prefetcher run past the array, and instead start prefetching the start of the array again. (loop tiling is a much better solution, see below.)
Intel CPUs only have 32k each L1-data and L1-instruction caches. I think your array would just barely fit in the L1 on an AMD CPU.
Gcc's attempt to vectorize by broadcasting the same value into a parallel add doesn't seem so crazy. If it had managed to get this right (using multiple accumulators to hide latency), that would have allowed it to saturate the vector FP adder with only half the memory bandwidth. As-is, it was pretty much a wash, probably because of overhead in broadcasting.
Also, it's pretty silly. The N_TIMES is a just a make-work repeat. We don't actually want to optimize for doing the identical work multiple times. Unless we want to win at silly assignments like this. A source-level way to do this would be to increment i in the part of the code we're allowed to modify:
for (...) {
sum += a[j] + a[j] + a[j] + a[j];
}
i += 3; // The inner loop does 4 total iterations of the outer loop
More realistically, to deal with this you could interchange your loops (loop over the array once, adding each value N_TIMES times). I think I've read that Intel's compiler will sometimes do that for you.
A more general technique is called cache blocking, or loop tiling. The idea is to work on your input data in small blocks that fit in cache. Depending on your algorithm, it can be possible to do various stages of thing on a chunk, then repeat for the next chunk, instead of having each stage loop over the whole input. As always, once you know the right name for a trick (and that it exists at all), you can google up a ton of info.
You could rules-lawyer your way into putting an interchanged loop inside an if (i == 0) block in the part of the code you're allowed to modify. It would still do the same number of additions, but in a more cache-optimal order.

I would try this for the inner loop:
double* tmp = array;
for (j = 0; j < ARRAY_SIZE; j++) {
sum += *tmp; // Use a pointer
tmp++; // because it is faster to increment the pointer
// than it is to recalculate array+j every time
help++;
}
or better
double* tmp = array;
double* end = array + ARRAY_SIZE; // Get rid of variable j by calculating
// the end criteria and
while (tmp != end) { // just compare if the end is reached
sum += *tmp;
tmp++;
help++;
}

I think You should read about openmp library if You could use multithreaded. But this is so simple example that I think could not be optimized.
Certain thing is that You don't need to declare i and j before for loop. That would do:
for (int i = 0; i < N_TIMES; i++)

why does GCC __builtin_prefetch not improve performance?

I'm writing a program to analyze a graph of social network. It means the program needs a lot of random memory accesses. It seems to me prefetch should help. Here is a small piece of the code of reading values from neighbors of a vertex.
for (size_t i = 0; i < v.get_num_edges(); i++) {
unsigned int id = v.neighbors[i];
res += neigh_vals[id];
}
I transform the code above to the one as below and prefetch the values of the neighbors of a vertex.
int *neigh_vals = new int[num_vertices];
for (size_t i = 0; i < v.get_num_edges(); i += 128) {
size_t this_end = std::min(v.get_num_edges(), i + 128);
for (size_t j = i; j < this_end; j++) {
unsigned int id = v.neighbors[j];
__builtin_prefetch(&neigh_vals[id], 0, 2);
}
for (size_t j = i; j < this_end; j++) {
unsigned int id = v.neighbors[j];
res += neigh_vals[id];
}
}
In this C++ code, I didn't override any operators.
Unfortunately, the code doesn't really improve the performance. I wonder why. Apparently, hardware prefetch doesn't work in this case because the hardware can't predict the memory location.
I wonder if it's caused by GCC optimization. When I compile the code, I enable -O3. I really hope prefetch can further improve performance even when -O3 is enabled. Does -O3 optimization fuse the two loops in this case? Can -O3 enable prefetch in this case by default?
I use gcc version 4.6.3 and the program runs on Intel Xeon E5-4620.
Thanks,
Da

Yes, some recent versions of GCC (e.g. 4.9 in march 2015) are able to issue some PREFETCH instruction when optimizing with -O3 (even without any explicit __builtin_prefetch)
We don't know what get_neighbor is doing, and what are the types of v and neigh_val.
And prefetching is not always profitable. Adding explicit __builtin_prefetch can slow down your code. You need to measure.
As Retired Ninja commented, prefetching in one loop and hoping data would be cached in the following loop (further down in your source code) is wrong.
You might perhaps try instead
for (size_t i = 0; i < v.get_num_edges(); i++) {
fg::vertex_id_t id = v.get_neighbor(i);
__builtin_prefetch (neigh_val[v.get_neighbor(i+4)]);
res += neigh_vals[id];
}
You could empirically replace the 4 with whatever appropriate constant is the best.
But I guess that the __builtin_prefetch above is useless (since the compiler is probably able to add it by itself) and it could harm (or even crash the program, when computing its argument gives undefined behavior, e.g. if v.get_neighbor(i+4) is undefined; however prefetching an address outside of your address space won't harm -but could slow down your program). Please benchmark.
See this answer to a related question.
Notice that in C++ all of [], get_neighbor could be overloaded and becomes very complex operations, so we cannot guess!
And there are cases where the hardware is limiting performance, whatever __builtin_prefetch you add (and adding them could hurt performance)
BTW, you might pass -O3 -mtune=native -fdump-tree-ssa -S -fverbose-asm to understand more what the compiler is doing (and look inside generated dump files and assembler files); also, it does happen that -O3 produces slightly slower code than what -O2 gives.
You could consider explicit multithreading, OpenMP, OpenCL if you have time to waste on optimization. Remember that premature optimization is evil. Did you benchmark, did you profile your entire application?

How to give hint to gcc about loop count

Knowing the number of iteration a loop will go through allows the compiler to do some optimization. Consider for instance the two loops below :
Unknown iteration count :
static void bitreverse(vbuf_desc * vbuf)
{
unsigned int idx = 0;
unsigned char * img = vbuf->usrptr;
while(idx < vbuf->bytesused) {
img[idx] = bitrev[img[idx]];
idx++;
}
}
Known iteration count
static void bitreverse(vbuf_desc * vbuf)
{
unsigned int idx = 0;
unsigned char * img = vbuf->usrptr;
while(idx < 1280*400) {
img[idx] = bitrev[img[idx]];
idx++;
}
}
The second version will compile to faster code, because it will be unrolled twice (on ARM with gcc 4.6.3 and -O2 at least). Is there a way to make assertion on the loop count that gcc will take into account when optimizing ?

There is hot attribute on functions to give a hint to compiler about hot-spot: http://gcc.gnu.org/onlinedocs/gcc/Function-Attributes.html. Just abb before your function:
static void bitreverse(vbuf_desc * vbuf) __attribute__ ((pure));
Here the docs about 'hot' from gcc:
hot The hot attribute on a function is used to inform the compiler
that the function is a hot spot of the compiled program. The function
is optimized more aggressively and on many target it is placed into
special subsection of the text section so all hot functions appears
close together improving locality. When profile feedback is available,
via -fprofile-use, hot functions are automatically detected and this
attribute is ignored.
The hot attribute on functions is not implemented in GCC versions
earlier than 4.3.
The hot attribute on a label is used to inform the compiler that path
following the label are more likely than paths that are not so
annotated. This attribute is used in cases where __builtin_expect
cannot be used, for instance with computed goto or asm goto.
The hot attribute on labels is not implemented in GCC versions earlier
than 4.8.
Also you can try to add __builtin_expect around your idx < vbuf->bytesused - it will be hint that in most cases the expression is true.
In both cases I'm not sure that your loop will be optimized.
Alternatively you can try profile-guided optimization. Build profile-generating version of program with -fprofile-generate; run it on target, copy profile data to build-host and rebuild with -fprofile-use. This will give a lot of information to compiler.
In some compilers (not in GCC) there are loop pragmas, including "#pragma loop count (N)" and "#pragma unroll (M)", e.g. in Intel; unroll in IBM; vectorizing pragmas in MSVC
ARM compiler (armcc) also has some loop pragmas: unroll(n) (via 1):
Loop Unrolling: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348b/CJACACFE.html and http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0348b/CJAHJDAB.html
and __promise intrinsic:
Using __promise to improve vectorization
The __promise(expr) intrinsic is a promise to the compiler that a given expression is nonzero. This enables the compiler to improve vectorization by optimizing away code that, based on the promise you have made, is redundant.
The disassembled output of Example 3.21 shows the difference that __promise makes, reducing the disassembly to a simple vectorized loop by the removal of a scalar fix-up loop.
Example 3.21. Using __promise(expr) to improve vectorization code
void f(int *x, int n)
{
int i;
__promise((n > 0) && ((n&7)==0));
for (i=0; i<n;i++) x[i]++;
}

You can actually specify the exact count with __builtin_expect, like this:
while (idx < __builtin_expect(vbuf->bytesused, 1280*400)) {
This tells gcc that vbuf->bytesused is expected to be 1280*400 at runtime.
Alas, this does nothing for optimization with current gcc version. Haven't tried with 4.8, though.
Edit: Just realized that every standard C compiler has a way to exactly specify the loop count, via assert. Since the assert
#include <assert.h>
...
assert(loop_count == 4096);
for (i = 0; i < loop_count; i++) ...
will call exit() or abort() if the condition is not true, any compiler with value propagation will know the exact value of loop_count. I always thought that this would be the most elegant and standard-conforming way to give such optimization hints. Now, I want a C compiler that actually uses this information.
Note that if you want to make this faster, bytewise unrolling might be less effective than using a wider lookup table. A 16-bit table would occupy 128K, and thus often fit into the CPU cache. If the data is not completely random, an even wider table (3 bytes) might be effective.
2-byte example:
unsigned short *bitrev2;
...
for (idx = 0; idx < vbuf->bytesused; idx += 2) {
*(unsigned short *)(&img[idx]) = bitrev2[*(unsigned short *)(&img[idx]);
}
This is an optimization the compiler can't perform, regardless of the information you give it.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight