Where does the SSE instructions outperform normal instructions

Where does the SSE instructions outperform normal instructions - c

Where does the x86-64's SSE instructions (vector instructions) outperform the normal instructions. Because what I'm seeing is that the frequent loads and stores that are required for executing SSE instructions is nullifying any gain we have due to vector calculation. So could someone give me an example SSE code where it performs better than the normal code.
Its maybe because I am passing each parameter separately, like this...
__m128i a = _mm_set_epi32(pa[0], pa[1], pa[2], pa[3]);
__m128i b = _mm_set_epi32(pb[0], pb[1], pb[2], pb[3]);
__m128i res = _mm_add_epi32(a, b);
for( i = 0; i < 4; i++ )
po[i] = res.m128i_i32[i];
Isn't there a way I can pass all the 4 integers at one go, I mean pass the whole 128 bytes of pa at one go? And assign res.m128i_i32 to po at one go?

Summarizing comments into an answer:
You have basically fallen into the same trap that catches most first-timers. Basically there are two problems in your example:
You are misusing _mm_set_epi32().
You have a very low computation/load-store ratio. (1 to 3 in your example)
_mm_set_epi32() is a very expensive intrinsic. Although it's convenient to use, it doesn't compile to a single instruction. Some compilers (such as VS2010) can generate very poor performing code when using _mm_set_epi32().
Instead, since you are loading contiguous blocks of memory, you should use _mm_load_si128(). That requires that the pointer is aligned to 16 bytes. If you can't guarantee this alignment, you can use _mm_loadu_si128() - but with a performance penalty. Ideally, you should properly align your data so that don't need to resort to using _mm_loadu_si128().
The be truly efficient with SSE, you'll also want to maximize your computation/load-store ratio. A target that I shoot for is 3 - 4 arithmetic instructions per memory-access. This is a fairly high ratio. Typically you have to refactor the code or redesign the algorithm to increase it. Combining passes over the data is a common approach.
Loop unrolling is often necessary to maximize performance when you have large loop bodies with long dependency chains.
Some examples of SO questions that successfully use SSE to achieve speedup.
C code loop performance (non-vectorized)
C code loop performance [continued] (vectorized)
How do I achieve the theoretical maximum of 4 FLOPs per cycle? (contrived example for achieving peak processor performance)

Related

find nan in array of doubles using simd

This question is very similar to:
SIMD instructions for floating point equality comparison (with NaN == NaN)
Although that question focused on 128 bit vectors and had requirements about identifying +0 and -0.
I had a feeling I might be able to get this one myself but the intel intrinsics guide page seems to be down :/
My goal is to take an array of doubles and to return whether a NaN is present in the array. I am expecting that the majority of the time that there won't be one, and would like that route to have the best performance.
Initially I was going to do a comparison of 4 doubles to themselves, mirroring the non-SIMD approach for NaN detection (i.e. NaN only value where a != a is true). Something like:
data *double = ...
__m256d a, b;
int temp = 0;
//This bit would be in a loop over the array
//I'd probably put a sentinel in and loop over while !temp
a = _mm256_loadu_pd(data);
b = _mm256_cmp_pd(a, a, _CMP_NEQ_UQ);
temp = temp | _mm256_movemask_pd(b);
However, in some of the examples of comparison it looks like there is some sort of NaN detection already going on in addition to the comparison itself. I briefly thought, well if something like _CMP_EQ_UQ will detect NaNs, I can just use that and then I can compare 4 doubles to 4 doubles and magically look at 8 doubles at once at the same time.
__m256d a, b, c;
a = _mm256_loadu_pd(data);
b = _mm256_loadu_pd(data+4);
c = _mm256_cmp_pd(a, b, _CMP_EQ_UQ);
At this point I realized I wasn't quite thinking straight because I might happen to compare a number to itself that is not a NaN (i.e. 3 == 3) and get a hit that way.
So my question is, is comparing 4 doubles to themselves (as done above) the best I can do or is there some other better approach to finding out whether my array has a NaN?

You might be able to avoid this entirely by checking fenv status, or if not then cache block it and/or fold it into another pass over the same data, because it's very low computational intensity (work per byte loaded/stored), so it easily bottlenecks on memory bandwidth. See below.
The comparison predicate you're looking for is _CMP_UNORD_Q or _CMP_ORD_Q to tell you that the comparison is unordered or ordered, i.e. that at least one of the operands is a NaN, or that both operands are non-NaN, respectively. What does ordered / unordered comparison mean?
The asm docs for cmppd list the predicates and have equal or better details than the intrinsics guide.
So yes, if you expect NaN to be rare and want to quickly scan through lots of non-NaN values, you can vcmppd two different vectors against each other. If you cared about where the NaN was, you could do extra work to sort that out once you know that there is at least one in either of two input vectors. (Like _mm256_cmp_pd(a,a, _CMP_UNORD_Q) to feed movemask + bitscan for lowest set bit.)
OR or AND multiple compares per movemask
Like with other SSE/AVX search loops, you can also amortize the movemask cost by combining a few compare results with _mm256_or_pd (find any unordered) or _mm256_and_pd (check for all ordered). E.g. check a couple cache lines (4x _mm256d with 2x _mm256_cmp_pd) per movemask / test/branch. (glibc's asm memchr and strlen use this trick.) Again, this optimizes for your common case where you expect no early-outs and have to scan the whole array.
Also remember that it's totally fine to check the same element twice, so your cleanup can be simple: a vector that loads up to the end of the array, potentially overlapping with elements you already checked.
// checks 4 vectors = 16 doubles
// non-zero means there was a NaN somewhere in p[0..15]
static inline
int any_nan_block(double *p) {
__m256d a = _mm256_loadu_pd(p+0);
__m256d abnan = _mm256_cmp_pd(a, _mm256_loadu_pd(p+ 4), _CMP_UNORD_Q);
__m256d c = _mm256_loadu_pd(p+8);
__m256d cdnan = _mm256_cmp_pd(c, _mm256_loadu_pd(p+12), _CMP_UNORD_Q);
__m256d abcdnan = _mm256_or_pd(abnan, cdnan);
return _mm256_movemask_pd(abcdnan);
}
// more aggressive ORing is possible but probably not needed
// especially if you expect any memory bottlenecks.
I wrote the C as if it were assembly, one instruction per source line. (load / memory-source cmppd). These 6 instructions are all single-uop in the fused-domain on modern CPUs, if using non-indexed addressing modes on Intel. test/jnz as a break condition would bring it up to 7 uops.
In a loop, an add reg, 16*8 pointer increment is another 1 uop, and cmp / jne as a loop condition is one more, bringing it up to 9 uops. So unfortunately on Skylake this bottlenecks on the front-end at 4 uops / clock, taking at least 9/4 cycles to issue 1 iteration, not quite saturating the load ports. Zen 2 or Ice Lake could sustain 2 loads per clock without any more unrolling or another level of vorpd combining.
Another trick that might be possible is to use vptest or vtestpd on two vectors to check that they're both non-zero. But I'm not sure it's possible to correctly check that every element of both vectors is non-zero. Can PTEST be used to test if two registers are both zero or some other condition? shows that the other way (that _CMP_UNORD_Q inputs are both all-zero) is not possible.
But this wouldn't really help: vtestpd / jcc is 3 uops total, vs. vorpd / vmovmskpd / test+jcc also being 3 fused-domain uops on existing Intel/AMD CPUs with AVX, so it's not even a win for throughput when you're branching on the result. So even if it's possible, it's probably break even, although it might save a bit of code size. And wouldn't be worth considering if it takes more than one branch to sort out the all-zeros or mix_zeros_and_ones cases from the all-ones case.
Avoiding work: check fenv flags instead
If your array was the result of computation in this thread, just check the FP exception sticky flags (in MXCSR manually, or via fenv.h fegetexcept) to see if an FP "invalid" exception has happened since you last cleared FP exceptions. If not, I think that means the FPU hasn't produced any NaN outputs and thus there are none in arrays written since then by this thread.
If it is set, you'll have to check; the invalid exception might have been raised for a temporary result that didn't propagate into this array.
Cache blocking:
If/when fenv flags don't let you avoid the work entirely, or aren't a good strategy for your program, try to fold this check into whatever produced the array, or into the next pass that reads it. So you're reusing data while it's already loaded into vector registers, increasing computational intensity. (ALU work per load/store.)
Even if data is already hot in L1d, it will still bottleneck on load port bandwidth: 2 loads per cmppd still bottlenecks on 2/clock load port bandwidth, on CPUs with 2/clock vcmppd ymm (Skylake but not Haswell).
Also worthwhile to align your pointers to make sure you're getting full load throughput from L1d cache, especially if data is sometimes already hot in L1d.
Or at least cache-block it so you check a 128kiB block before running another loop on that same block while it's hot in cache. That's half the size of 256k L2 so your data should still be hot from the previous pass, and/or hot for the next pass.
Definitely avoid running this over a whole multi-megabyte array and paying the cost of getting it into the CPU core from DRAM or L3 cache, then evicting again before another loop reads it. That's worst case computational intensity, paying the cost of getting it into a CPU core's private cache more than once.

What is 'differential timing' technique for doing benchmarks?

At around 39 minute of "Writing Fast Code I" by Andrei Alexandrescu (link here to youtube)
there is a slide of how to use differential timing... can someone show me some basic code with this approach? It was only mentioned for a second, but I think that's an interesting idea.
Run baseline 2n times (t2a)
vs. baseline n times + contender n times (ta+b).
Relative improvement = "t2a / (2ta+b - t2a)"
some overhead noises canceled

Alexsandrescu's slide is rather trivial to pour into code:
auto start = clock::now();
for( int i = 0; i < 2*n; i++ )
baseline();
auto t2a = clock::now() - start;
start = clock::now();
for( int i = 0; i < n; i++ )
baseline();
// *
for( int i = 0; i < n; i++ )
contender();
auto taplusb = clock::now() - start;
double r = t2a / (2 * taplusb - t2a) // relative speedup
* Synchronization point which prevents optimization across the last two loops.
I would be more interested in the mathematical reasoning behind measuring the relative speed up this way as opposed to simply tBaseline / tContender as I've been doing for ever. He only vaguely hints at '...overhead noise (being) cancelled (out)', but doesn't explain it in detail.

If you keep watching until 41:40 or so, he mentions it again when warning about the pitfall of first run vs. subsequent (allocators warmed up, etc.)
The best solution for that is doing warm-up runs before the first timed region.
I think he's picturing that 2n baseline vs. n baseline + n contender in separate invocations of the benchmark program.
So instead of doing some warmup runs before the timed region, he's using the baseline as a controlled warmup inside the timed region. This might make it possible to just time the whole program, e.g. perf stat, instead of calling a time function inside the program. Depending on how much process startup overhead your OS has vs. how long you make your repeat loop.
Microbenchmarking is hard and there are many pitfalls. Notably benchmarking optimized code while still making sure there isn't optimization between iterations of your repeat loop. (Often it's useful to use inline asm "escape" macros to force the compiler to materialize a value in an integer register, and/or to forget about the value of a variable to defeat CSE. Sometimes it's sufficient to just add the final result of each iteration to a sum that you print at the end.)
This is the first I've heard of this differential idea. It doesn't sound more useful than normal warm-ups.
If anything it will make the contender look slightly worse than using the function under test for some warm-up runs before the timed region. Using the same function as the timed region will warm up branch-prediction for it. Or not because after inlining the warm-up vs. main versions will be at different addresses. The same pattern at different addresses may possibly still help a modern TAGE predictor but IDK.
Or if contender has any lookup tables, those will become hot in cache from the warmup.
In any case, warmups are essential, unless you make the repeat count long enough to dwarf the time it takes for the CPU to switch to max turbo and so on. And to page-fault in all the memory you touch.
If your calculated time/iteration doesn't stay constant with your repeat count, your microbenchmark is broken.
Take the rest of his advice with a grain of salt, too. Most of it use useful (e.g. prefer 32-bit integers even for local temporaries, not just for arrays for cache-footprint reasons), but the reasoning is wrong for some of them.
His explanation that an ALU can do 2x 32-bit adds or 1x 64-bit add only applies to SIMD: 4x 32-bit int in a vector for paddd or 2x 64-bit int in a vector for paddq. But x86 scalar add r32, r32 has the same throughput as add r64,r64. I don't think it was true even on Pentium 4 (Nocona) despite P4 having funky double-pumped ALUs with 0.5 cycle latency for add. At least before Prescott/Nocona which introduced 64-bit support.
Using 32-bit unsigned integers on x86-64 can stop the compiler from optimizing to pointer increments if it wants to. It has to maintain correctness in case of 32-bit wraparound of a variable before array indexing.
Using 16-bit or 8-bit locals to match the data in an array can sometimes help auto-vectorization, IIRC. Gcc/clang sometimes make really braindead code that unpacks to 32-bit and then re-packs down to 8-bit elements, when processing an array of int8_t or uint8_t. I forget if I've every actually worked around that by using narrow locals, though. C default integer promotions bring most expressions back up to 32-bit.
Also, at https://youtu.be/vrfYLlR8X8k?t=3498, he claims that FP->int is expensive. That's never been true on x86-64: FP math uses SSE/SSE2 which has an instruction that does truncating conversion. FP->int used to be slow in the bad old days of x87 math, where you had to change the FP rounding mode, fistp, then change it back, to get C truncation semantics. But SSE includes cvttsd2si exactly for that common case.
He also says float is no faster than double. That's true for scalar (other than div/sqrt), but if your code can auto-vectorize then you get twice as much work done per instruction and the instructions have the same throughput. (Twice as many elements fit in a SIMD vector.)
How the math works:
It just cancels out the n * baseline time from both parts, effectively doing (2 * baseline) / (2*contender) = baseline/contender.
It assumes that the times add normally (not overlapping computation). t_2a = 2 * baseline, and 2 * t_ab = 2 * baseline + 2 * contender. Subtracting cancels the 2*baseline parts, leaving you with 2*contender.
The trick isn't in the math, if anything this is more mathematically dangerous because subtracting two larger numbers accumulates error. i.e. if the n*baseline actually takes different amounts of time in the two runs (because you didn't control that perfectly), then it doesn't cancel and contributes error to your estimate.

Is Intel Xeon Phi used intrinsics get good performance than Auto-Vectorization?

Intel Xeon Phi provides using the "IMCI" instruction set ,
I used it to do "c = a*b" , like this:
float* x = (float*) _mm_malloc(N*sizeof(float), ALIGNMENT) ;
float* y = (float*) _mm_malloc(N*sizeof(float), ALIGNMENT) ;
float z[N];
_Cilk_for(size_t i = 0; i < N; i+=16)
{
__m512 x_1Vec = _mm512_load_ps(x+i);
__m512 y_1Vec = _mm512_load_ps(y+i);
__m512 ans = _mm512_mul_ps(x_1Vec, y_1Vec);
_mm512_store_pd(z+i,ans);
}
And test it's performance , when the N SIZE is 1048576,
it need cost 0.083317 Sec , I want to compare the performance with auto-vectorization
so the other version code like this:
_Cilk_for(size_t i = 0; i < N; i++)
z[i] = x[i] * y[i];
This version cost 0.025475 Sec(but sometimes cost 0.002285 or less, I don't know why?)
If I change the _Cilk_for to #pragma omp parallel for, the performance will be poor.
so, if the answer like this, why we need to use intrinsics?
Did I make any mistakes any where?
Can someone give me some good suggestion to optimize the code?

The measurements don't mean much, because of various mistakes.
The code is storing 16 floats as 8 doubles. The _mm512_store_pd should be _mm512_store_ps.
The code is using _mm512_store_... on an unaligned location with address z+i, which may cause a segmentation fault. Use __declspec(align(64)) to fix this.
The arrays x and y are not initialized. That risks introducing random numbers of denormal values, which might impact performance. (I'm not sure if this is an issue for Intel Xeon Phi).
There's no evidence that z is used, hence the optimizer might remove the calculation. I think it is not the case here, but it's a risk with trivial benchmarks like this.
Also, allocating a large array on the stack risks stack overflow.
A single run of the examples is probably a poor benchmark, because the time is probably dominated by fork/join overheads of the _Cilk_for. Assuming 120 Cilk workers (the default for 60 4-way threaded cores), there is only about 1048576/120/16 = ~546 iterations per worker. With a clock rate over 1 GHz, that won't take long. In fact, the work in the loop is so small that most likely some workers never get a chance to steal work. That might account for why the _Cilk_for outruns OpenMP. In OpenMP, all the threads must take part in a fork/join for a parallel region to finish.
If the test were written to correct all the mistakes, it would essentially be computing z[:] = x[:]*y[:] on a large array. Because of the wide vector units on Intel(R) Xeon Phi(TM), this becomes a test of memory/cache bandwidth, not ALU speed, since the ALU is quite capable of outrunning memory bandwidth.
Intrinsics are useful for things that can't be expressed as parallel/simd loops, typically stuff needing fancy permutations. For example, I've used intrinsics to do a 16-element prefix-sum operation on MIC (only 6 instructions if I remember correctly).

My answer below equally applies to Intel Xeon and Intel Xeon Phi.
Intrinsics-bases solution is most "powerful" just "like" assembly coding is.
but on the negative side, intrinsics-based solution is usually not (most) portable, not "productivity"-
oriented approach and is often non-applicable for established "legacy" software codebases.
plus it often requires programmer to be low-level and even micro-architecture expert.
However there are approaches alternate to intrinsics/assembly coding. They are:
A) auto-vectorization (when compiler recognizes some patterns and automatically generate vector code)
B) "explicit" or user-guided vectorization (when programmer provide some guidance to compiler in terms of what to vectorize and under which conditions, etc; explicit vectorization usually implies using keywords or pragmas)
C) Using VEC clasess or other kind of intrinsics wrapper libraries or even very specialized compilers. In fact, 2.C is often as bad as intrinsics coding in terms of productivity and legacy code incremental updates)
In your second code snippet you seem to use "explicit" vectorization, which is currently achievable when using Cilk Plus and OpenMP4.0 "frameworks" supported by all recent versions of Intel Compiler and also by GCC4.9.
(I said that you seem to use explicit vectorization, because Cilk_for was originally invented for the purpose of multi-threading, however most recent version of Intel Compiler might automatically parallelize and vectorize the loop, when cilk_for is used)

What is the limit of optimization using SIMD?

I need to optimize some C code, which does lots of physics computations, using SIMD extensions on the SPE of the Cell Processor. Each vector operator can process 4 floats at the same time. So ideally I would expect a 4x speedup in the most optimistic case.
Do you think the use of vector operators could give bigger speedups?
Thanks

The best optimization occurs in rethinking the algorithm. Eliminate unnecessary steps. Find more a direct way of accomplishing the same result. Compute the solution in a domain more relevant to the problem.
For example, if the vector array is a list of n which are all on the same line, then it is sufficient to transform the end points only and interpolate the intermediate points.

It CAN give better speeds up than 4 times over straight floating point as the SIMD instructions could be less exact (Not so much as to give too many problems though) and so take fewer cycles to execute. It really depends.
Best plan is to learn as much about the processor you are optimising for as possible. You may find it can give you far better than 4x improvements. You may find out you can't. We can't say though without knowing more about the algorithm you are optimising and what CPU you are targetting.

On their own, no. But if the process of re-writing your algorithms to support them also happens to improve, say, cache locality or branching behaviour, then you could find unrelated speed-ups. However, this is true of any re-write...

This is entirely possible.
You can do more clever instruction-level micro optimizations than a compiler, if you know what you're doing.
Most SIMD instruction sets offers several powerful operations that don't have any equivalent in normal scalar FPU/ALU code (e.g. PAVG/PMIN etc. in SSE2). Even if these don't fit your problem exactly, you can often combine these instructions for great effect.
Not sure about Cell, but most SIMD instruction sets have features to optimize memory access, for example to prefetch data into cache. I've had very good results with these.
Now this isn't Cell or PPC at all, but a simple image convolution filter of mine got a 20x speedup (C vs. SSE2) on Atom, which is higher than the level of parallelity (16 pixels at a time).

It depends on the architecture.. For the moment I assume x86 architecture (aka SSE).
You can get factor four on tight loops easily. Just replace your existing math with SSE instruction and you're done.
You can even get a little more than that because if you use SSE you do the math in registers which are usually not used by the compiler. This frees up the general purpose register for other task such as loop control and address calculation. In short the code that surrounds the SSE instruction will be more compact and execute faster.
And then there is the option to hint the memory controller how you want to access the memory, e.g. if you want to store data in a way that it bypasses the cache or not. For bandwidth hungry algorithms that may give you some more extra speed ontop of that.

What is the maximum theoretical speed-up due to SSE for a simple binary subtraction?

In trying to figure out whether or not my code's inner loop is hitting a hardware design barrier or a lack of understanding on my part barrier. There's a bit more to it, but the simplest question I can come up with to answer is as follows:
If I have the following code:
float px[32768],py[32768],pz[32768];
float xref, yref, zref, deltax, deltay, deltaz;
initialize_with_random(px);
initialize_with_random(py);
initialize_with_random(pz);
for(i=0;i<32768-1;i++) {
xref=px[i];
yref=py[i];
zref=pz[i];
for(j=0;j<32768-1;j++ {
deltx=xref-px[j];
delty=yref-py[j];
deltz=zref-pz[j];
} }
What type of maximum theoretical speed up would I be able to see by going to SSE instructions in a situation where I have complete control over code (assembly, intrinsics, whatever) but no control over runtime environment other than architecture (i.e. it's a multi-user environment so I can't do anything about how the OS kernel assigns time to my particular process).
Right now I'm seeing a speed up of 3x with my code, when I would have thought using SSE would give me much more vector depth than the 3x speed up is indicating (presumably the 3x speed up tells me I have a 4x maximum theoretical throughput). (I've tried things such as letting deltx/delty/deltz be arrays in case the compiler wasn't smart enough to auto-promote them, but I still see only 3x speed up.) I'm using the intel C compiler with the appropriate compiler flags for vectorization, but no intrinsics obviously.

It depends on the CPU. But the theoretical max won't get above 4x. I don't know of a CPU which can execute more than one SSE instruction per clock cycle, which means that it can at most compute 4 values per cycle.
Most CPU's can do at least one floating point scalar instruction per cycle, so in this case you'd see a theoretical max of a 4x speedup.
But you'll have to look up the specific instruction throughput for the CPU you're running on.
A practical speedup of 3x is pretty good though.

I think you'd probably have to interleave the inner loop somehow. The 3-component vector is getting done at once, but that's only 3 operations at once. To get to 4, you'd do 3 components from the first vector, and 1 from the next, then 2 and 2, and so on. If you established some kind of queue that loads and processes the data 4 components at a time, then separate it after, that might work.
Edit: You could unroll the inner loop to do 4 vectors per iteration (assuming the array size is always a multiple of 4). That would accomplish what I said above.

Consider: How wide is a float? How wide is the SSEx instruction? The ratio should should give you some kind of reasonable upper bound.
It's also worth noting that out-of-order pipes play havok with getting good estimates of speedup.

You should consider loop tiling - the way you are accessing values in the inner loop is probably causing a lot of thrashing in the L1 data cache. It's not too bad, because everything probably still fits in the L2 at 384 KB, but there is easily an order of magnitude difference between an L1 cache hit and an L2 cache hit, so this could make a big difference for you.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight