how will unrolling affect the cycles per element count CPE - c

how do I calculate CPE (cycles per element) with these code snippets?
what is the difference in the CPE between the 2 given code snippets?
I have this piece of code
void randomFunction(float a[],float Tb[],float c[],long int n){
int i,j,k;
for(i=0;i<n;++i)
for(j=0;j<n;++j)
for(k=0;k<n-1;k++){
temp+= a[i*n+k] * Tb[j*n+k];
}
}
This is the assembly for the inner-most loop, from GCC10.3 -O2 (https://godbolt.org/z/cWE16rW1r), for a version of the function with a local float temp=0; that it returns so the loop won't be optimized away:
.L4:
movss (%rcx,%rax), %xmm0
mulss (%rdx,%rax), %xmm0
addq $4, %rax
addss %xmm0, %xmm1
cmpq %rax, %rsi
jne .L4
Now I am trying to 'optimise it' using unrolling.
void randomUnrollingFunction(float a[],float Tb[],float c[],long int n){
int i,j,k;
for(i=0;i<n;++i)
for(j=0;j<n;++j)
for(k=0;k<n-1;k+=2){//this is the unrolled portion by 2
temp+= a[i*n+k] * Tb[j*n+k];
temp+= a[i*n+k+1]* Tb[j*n+k+1];
}
}
I am wondering what is the the estimated CPE that will be achieved with this unrolling by factor 2.
CPE is number of cycles/ number of instructions
This is the information for the latency:
Thank you for any help in advance!

Your loop entirely bottlenecks on addss latency (float add), at 3 cycles per 1 element. Out-of-order exec lets the other work run in the "shadow" of this. Assuming your Haswell-like CPU has the memory bandwidth to keep up1
This way of unrolling doesn't help at all, not changing the serial dependency chain unless you compiled with -ffast-math or something to let a compiler turn it into
temp1 += a[...+0]* Tb[...+0];
temp2 += a[...+1]* Tb[...+1];
Or add pairs before feeding them into that serial dependency, like
temp += a[]*Tb[] + a[+1]*Tb[+1];
One long serial dependency chain is the worst, and also numerically not great: pairwise summation (or especially just a step in that direction using multiple accumulators) would be numerically better and also perform better. (Simd matmul program gives different numerical results).
(If your multiple accumulators are 4 elements of on SIMD vector, you can do 4x the amount of work with the same pattern of asm instructions. But you then need to unroll with multiple vectors, because addps has the same performance characteristics as addss on modern x86 CPUs.)
Footnote 1: Two sequential read streams of 4 bytes each per 3 cycles; certainly a desktop Haswell could keep up, and probably even a Xeon competing with many other cores for memory bandwidth. But the reads from a[] probably hit in cache, because a[i*n+k] is the same row repeatedly until we move on to the next outer-loop iteration. So only 1 row of a[] has to stay hot in cache (to get hits next middle iteration) while we scan through a row of Tb. So a[] has to come in from DRAM once total, if n isn't huge, but we loop over the whole Tb[] in order n times.
More detailed version
See What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand? - look for the dependency chains (in this case the addss into %xmm1). Also A whirlwind introduction to dataflow graphs.
Then look for throughput bottlenecks in the back-end and front-end. In your case, latency dominates. (Assuming the front-end matches this Haswell back-end, although it wouldn't take much to keep up with this back-end latency bottleneck. Also, I hate that they number their "functional units" from 1, instead of following Intel's numbering of ports 0,1,5,6 having ALUs. Haswell's FP adder is on port 1, and ports 2/3 are load/store-AGU units, etc.)
ADDSS has 3 cycle latency, so temp += ... can only execute once per 3 cycles. The load / MULSS operations just independently prepare inputs for ADDSS, and the loop overhead is also independent.
Note that if not for the latency bottleneck, your loop would bottleneck on the front-end on a real Haswell (4 fused-domain uops per cycle), not on back-end functional units. The loop is 5 fused-domain uops, assuming macro-fusion of the cmp/jne, and that Haswell can keep the memory-source addss micro-fused despite the indexed addressing mode. (Sandybridge would un-laminate it.)
In the general case, knowing the back-end functional units is not sufficient. The front-end is also a common bottleneck, especially in loops that have to store something, like an actual matmul.
But thanks to the serial dependency on ADDSS (which actually carries across outer loops), the only thing that matters is that dependency chain.
Even a branch mispredict on the last iteration of the inner loop (when the branch is not-taken instead of normally taken), that will just give the back end more time to chew through those pending ADDSS operations while the front-end sorts itself out and starts in on the next inner loop.
Since you unrolled in a way that doesn't change the serial dependency, it makes zero difference to performance except for tiny n. (For tiny n, the whole thing can potentially overlap some of this work with independent work in the caller before/after the call. In that case, saving instructions could be helpful, also allowing out-of-order exec to "see farther". Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths is a case where OoO exec can (partially) overlap two independent imul dep chains that in program-order are one after the other.
Of course at that point, you're considering code outside what you're showing. And even for n=10, that's 10^3 = 1000 inner iterations, and Haswell's ROB is only 192 uops large, with an RS of 60 entries. (https://www.realworldtech.com/haswell-cpu/3/).
Unrolling in a useful way
See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) re: unrolling in ways that do create more instruction-level parallelism, hiding FP dependency chains.
Unrolling differently, only summing once into temp per loop iteration, will keep the same cycles per iteration while still doubling the number of elements you process.
for(k=0;k<n-1;k+=2){//this is the unrolled portion by 2
temp += (a[i*n+k] * Tb[j*n+k] +
a[i*n+k+1]* Tb[j*n+k+1]);
}
Obviously you can keep doing this until you run into front-end or back-end throughput limits, like one add per clock. The above version does two adds per 3 clocks.
Your "functional unit" table doesn't list FMA (fused multiply-add), but real Haswell has it, with identical performance to FP mul. Iit wouldn't help much if any because your current loop structure does 2 loads per mul+add, so reducing that to 2 loads and one FMA would still bottleneck on loads. Might help with front-end throughput?
What might help reduce loads is unrolling over one of the outer loops, using both a[i*n+k] and a[(i+1)*n+k] with one Tb[j*n+k]. That of course changes the order of the calculation, so isn't legal for a compiler without -ffast-math because FP math isn't strictly associative.
This is a reduction of a matmul, allowing much better optimization
(err wait, your code didn't show where temp was re-initialized, or what the c[] arg was for. I just assumed it was global or something, but probably you actually butchered a normal matmul function that stores a separate temp after every inner iteration to c[] by taking that out. In that case, OoO exec between separate middle loop iteration is relevant for medium-sized n. But you don't show the scheduler / ROB sizes, and that's not something you can easily model; you need to actually benchmark. So this section is probably only applicable to the question I invented, not what you meant to ask!)
Your loops appear to be summing the elements of a matmul result, but still structured like a matmul. i.e. do a row x column dot product, but instead of storing that into an N x N result[...] matrix, you're just summing the result.
That amounts to summing every pairwise product of elements between two matrices. Since we don't need to keep separate row x column dot products separate anymore, that enables a lot of optimizations! (There's nothing special or ideal about this order of summation; other orders will have different but likely not worse rounding error.)
For example, you don't need to transpose b into Tb, just use b (unless it was naturally already transposed, in which case that's fine). Your matrices are square so it doesn't matter at all.
Furthermore, you can simply load one or a couple elements from a[] and loop over Tb[], doing those products with FMA operations, one load per FMA, into 10 or 12 vector accumulators. (Or of course cache-block this to loop over a contiguous part of Tb that can stay hot in L1d cache.)
That could approach Haswell's max FLOPS throughput of 2x 256-bit FMA per clock = 8 (float elements per YMM vector) x 2 FMAs / clock x 2 FLOP/FMA = 32 FLOP / clock.

Related

Why in the C language loop, accessing array elements is so much slower than accessing variables? [duplicate]

I want to be able to predict, by hand, exactly how long arbitrary arithmetical (i.e. no branching or memory, though that would be nice too) x86-64 assembly code will take given a particular architecture, taking into account instruction reordering, superscalarity, latencies, CPIs, etc.
What / describe the rules must be followed to achieve this?
I think I've got some preliminary rules figured out, but I haven't been able to find any references on breaking down any example code to this level of detail, so I've had to take some guesses. (For example, the Intel optimization manual barely even mentions instruction reordering.)
At minimum, I'm looking for (1) confirmation that each rule is correct or else a correct statement of each rule, and (2) a list of any rules that I may have forgotten.
As many instructions as possible are issued each cycle, starting in-order from the current cycle and potentially as far ahead as the reorder buffer size.
An instruction can be issued on a given cycle if:
No instructions that affect its operands are still being executed. And:
If it is a floating-point instruction, every floating-point instruction before it has been issued (floating-point instructions have static instruction re-ordering). And:
There is a functional unit available for that instruction on that cycle. Every (?) functional unit is pipelined, meaning it can accept 1 new instruction per cycle, and the number of total functional units is 1/CPI, for the CPI of a given function class (nebulous here: presumably e.g. addps and subps use the same functional unit? How do I determine this?). And:
Fewer than the superscalar width (typically 4) number of instructions have already been issued this cycle.
If no instructions can be issued, the processor simply doesn't issue any—a condition called a "stall".
As an example, consider the following example code (which computes a cross-product):
shufps xmm3, xmm2, 210
shufps xmm0, xmm1, 201
shufps xmm2, xmm2, 201
mulps xmm0, xmm3
shufps xmm1, xmm1, 210
mulps xmm1, xmm2
subps xmm0, xmm1
My attempt to predict the latency for Haswell looks something like this:
; `mulps` Haswell latency=5, CPI=0.5
; `shufps` Haswell latency=1, CPI=1
; `subps` Haswell latency=3, CPI=1
shufps xmm3, xmm2, 210 ; cycle 1
shufps xmm0, xmm1, 201 ; cycle 2
shufps xmm2, xmm2, 201 ; cycle 3
mulps xmm0, xmm3 ; (superscalar execution)
shufps xmm1, xmm1, 210 ; cycle 4
mulps xmm1, xmm2 ; cycle 5
; cycle 6 (stall `xmm0` and `xmm1`)
; cycle 7 (stall `xmm1`)
; cycle 8 (stall `xmm1`)
subps xmm0, xmm1 ; cycle 9
; cycle 10 (stall `xmm0`)
TL:DR: look for dependency chains, especially loop-carried ones. For a long-running loop, see which latency, front-end throughput, or back-end port contention/throughput is the worst bottleneck. That's how many cycles your loop probably takes per iteration, on average, if there are no cache misses or branch mispredicts.
Latency bounds and throughput bounds for processors for operations that must occur in sequence is a good example of analyzing loop-carried dependency chains in a specific loop with two dep chains, one pulling values from the other.
Related: How many CPU cycles are needed for each assembly instruction? is a good introduction to throughput vs. latency on a per-instruction basis, and how what that means for sequences of multiple instructions. See also Assembly - How to score a CPU instruction by latency and throughput for how to measure a single instruction.
This is called static (performance) analysis. Wikipedia says (https://en.wikipedia.org/wiki/List_of_performance_analysis_tools) that AMD's AMD CodeXL has a "static kernel analyzer" (i.e. for computational kernels, aka loops). I've never tried it.
Intel also has a free tool for analyzing how loops will go through the pipeline in Sandybridge-family CPUs: What is IACA and how do I use it?
IACA is not bad, but has bugs (e.g. wrong data for shld on Sandybridge, and last I checked, it doesn't know that Haswell/Skylake can keep indexed addressing modes micro-fused for some instructions. But maybe that will change now that Intel's added details on that to their optimization manual.) IACA is also unhelpful for counting front-end uops to see how close to a bottleneck you are (it likes to only give you unfused-domain uop counts).
Static analysis is often pretty good, but definitely check by profiling with performance counters. See Can x86's MOV really be "free"? Why can't I reproduce this at all? for an example of profiling a simple loop to investigate a microarchitectural feature.
Essential reading:
Agner Fog's microarch guide (chapter 2: Out of order exec) explains some of the basics of dependency chains and out-of-order execution. His "Optimizing Assembly" guide has more good introductory and advanced performance stuff.
The later chapters of his microarch guide cover the details of the pipelines in CPUs like Nehalem, Sandybridge, Haswell, K8/K10, Bulldozer, and Ryzen. (And Atom / Silvermont / Jaguar).
Agner Fog's instruction tables (spreadsheet or PDF) are also normally the best source for instruction latency / throughput / execution-port breakdowns.
David Kanter's microarch analysis docs are very good, with diagrams. e.g. https://www.realworldtech.com/sandy-bridge/, https://www.realworldtech.com/haswell-cpu/, and https://www.realworldtech.com/bulldozer/.
See also other performance links in the x86 tag wiki.
I also took a stab at explaining how a CPU core finds and exploits instruction-level parallelism in this answer, but I think you've already grasped those basics as far as it's relevant for tuning software. I did mention how SMT (Hyperthreading) works as a way to expose more ILP to a single CPU core, though.
In Intel terminology:
"issue" means to send a uop into the out-of-order part of the core; along with register-renaming, this is the last step in the front-end. The issue/rename stage is often the narrowest point in the pipeline, e.g. 4-wide on Intel since Core2. (With later uarches like Haswell and especially Skylake often actually coming very close to that in some real code, thanks to SKL's improved decoders and uop-cache bandwidth, as well as back-end and cache bandwidth improvements.) This is fused-domain uops: micro-fusion lets you send 2 uops through the front-end and only take up one ROB entry. (I was able to construct a loop on Skylake that sustains 7 unfused-domain uops per clock). See also http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ re: out-of-order window size.
"dispatch" means the scheduler sends a uop to an execution port. This happens as soon as all the inputs are ready, and the relevant execution port is available. How are x86 uops scheduled, exactly?. Scheduling happens in the "unfused" domain; micro-fused uops are tracked separately in the OoO scheduler (aka Reservation Station, RS).
A lot of other computer-architecture literature uses these terms in the opposite sense, but this is the terminology you will find in Intel's optimization manual, and the names of hardware performance counters like uops_issued.any or uops_dispatched_port.port_5.
exactly how long arbitrary arithmetical x86-64 assembly code will take
It depends on the surrounding code as well, because of OoO exec
Your final subps result doesn't have to be ready before the CPU starts running later instructions. Latency only matters for later instructions that need that value as an input, not for integer looping and whatnot.
Sometimes throughput is what matters, and out-of-order exec can hide the latency of multiple independent short dependency chains. (e.g. if you're doing the same thing to every element of a big array of multiple vectors, multiple cross products can be in flight at once.) You'll end up with multiple iterations in flight at once, even though in program order you finish all of one iteration before doing any of the next. (Software pipelining can help for high-latency loop bodies if OoO exec has a hard time doing all the reordering in HW.)
There are three major dimensions to analyze for a short block
You can approximately characterize a short block of non-branching code in terms of these three factors. Usually only one of them is the bottleneck for a given use-case. Often you're looking at a block that you will use as part of a loop, not as the whole loop body, but OoO exec normally works well enough that you can just add up these numbers for a couple different blocks, if they're not so long that OoO window size prevents finding all the ILP.
latency from each input to the output(s). Look at which instructions are on the dependency chain from each input to each output. e.g. one choice might need one input to be ready sooner.
total uop count (for front-end throughput bottlenecks), fused-domain on Intel CPUs. e.g. Core2 and later can in theory issue/rename 4 fused-domain uops per clock into the out-of-order scheduler/ROB. Sandybridge-family can often achieve that in practice with the uop cache and loop buffer, especially Skylake with its improved decoders and uop-cache throughput.
uop count for each back-end execution port (unfused domain). e.g. shuffle-heavy code will often bottleneck on port 5 on Intel CPUs. Intel usually only publishes throughput numbers, not port breakdowns, which is why you have to look at Agner Fog's tables (or IACA output) to do anything meaningful if you're not just repeating the same instruction a zillion times.
Generally you can assuming best-case scheduling/distribution, with uops that can run on other ports not stealing the busy ports very often, but it does happen some. (How are x86 uops scheduled, exactly?)
Looking at CPI is not sufficient; two CPI=1 instructions might or might not compete for the same execution port. If they don't, they can execute in parallel. e.g. Haswell can only run psadbw on port 0 (5c latency, 1c throughput, i.e. CPI=1) but it's a single uop so a mix of 1 psadbw + 3 add instructions could sustain 4 instructions per clock. There are vector ALUs on 3 different ports in Intel CPUs, with some operations replicated on all 3 (e.g. booleans) and some only on one port (e.g. shifts before Skylake).
Sometimes you can come up with a couple different strategies, one maybe lower latency but costing more uops. A classic example is multiplying by constants like imul eax, ecx, 10 (1 uop, 3c latency on Intel) vs. lea eax, [rcx + rcx*4] / add eax,eax (2 uops, 2c latency). Modern compilers tend to choose 2 LEA vs. 1 IMUL, although clang up to 3.7 favoured IMUL unless it could get the job done with only a single other instruction.
See What is the efficient way to count set bits at a position or lower? for an example of static analysis for a few different ways to implement a function.
See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) (which ended up being way more detailed than you'd guess from the question title) for another summary of static analysis, and some neat stuff about unrolling with multiple accumulators for a reduction.
Every (?) functional unit is pipelined
The divider is pipelined in recent CPUs, but not fully pipelined. (FP divide is single-uop, though, so if you do one divps mixed in with dozens of mulps / addps, it can have negligible throughput impact if latency doesn't matter: Floating point division vs floating point multiplication. rcpps + a Newton iteration is worse throughput and about the same latency.
Everything else is fully pipelined on mainstream Intel CPUs; multi-cycle (reciprocal) throughput for a single uop. (variable-count integer shifts like shl eax, cl have lower-than-expected throughput for their 3 uops, because they create a dependency through the flag-merging uops. But if you break that dependency through FLAGS with an add or something, you can get better throughput and latency.)
On AMD before Ryzen, the integer multiplier is also only partially pipelined. e.g. Bulldozer's imul ecx, edx is only 1 uop, but with 4c latency, 2c throughput.
Xeon Phi (KNL) also has some not-fully-pipelined shuffle instructions, but it tends to bottleneck on the front-end (instruction decode), not the back-end, and does have a small buffer + OoO exec capability to hide back-end bubbles.
If it is a floating-point instruction, every floating-point instruction before it has been issued (floating-point instructions have static instruction re-ordering)
No.
Maybe you read that for Silvermont, which doesn't do OoO exec for FP/SIMD, only integer (with a small ~20 uop window). Maybe some ARM chips are like that, too, with simpler schedulers for NEON? I don't know much about ARM uarch details.
The mainstream big-core microarchitectures like P6 / SnB-family, and all AMD OoO chips, do OoO exec for SIMD and FP instructions the same as for integer. AMD CPUs use a separate scheduler, but Intel uses a unified scheduler so its full size can be applied to finding ILP in integer or FP code, whichever is currently running.
Even the silvermont-based Knight's Landing (in Xeon Phi) does OoO exec for SIMD.
x86 is generally not very sensitive to instruction ordering, but uop scheduling doesn't do critical-path analysis. So it could sometimes help to put instructions on the critical path first, so they aren't stuck waiting with their inputs ready while other instructions run on that port, leading to a bigger stall later when we get to instructions that need the result of the critical path. (i.e. that's why it is the critical path.)
My attempt to predict the latency for Haswell looks something like this:
Yup, that looks right. shufps runs on port 5, addps runs on p1, mulps runs on p0 or p1. Skylake drops the dedicated FP-add unit and runs SIMD FP add/mul/FMA on the FMA units on p0/p1, all with 4c latency (up/down from 3/5/5 in Haswell, or 3/3/5 in Broadwell).
This is a good example of why keeping a whole XYZ direction vector in a SIMD vector usually sucks. Keeping an array of X, an array of Y, and an array of Z, would let you do 4 cross products in parallel without any shuffles.
The SSE tag wiki has a link to these slides: SIMD at Insomniac Games (GDC 2015) which covers that array-of-structs vs. struct-of-arrays issues for 3D vectors, and why it's often a mistake to always try to SIMD a single operation instead of using SIMD to do multiple operations in parallel.

find nan in array of doubles using simd

This question is very similar to:
SIMD instructions for floating point equality comparison (with NaN == NaN)
Although that question focused on 128 bit vectors and had requirements about identifying +0 and -0.
I had a feeling I might be able to get this one myself but the intel intrinsics guide page seems to be down :/
My goal is to take an array of doubles and to return whether a NaN is present in the array. I am expecting that the majority of the time that there won't be one, and would like that route to have the best performance.
Initially I was going to do a comparison of 4 doubles to themselves, mirroring the non-SIMD approach for NaN detection (i.e. NaN only value where a != a is true). Something like:
data *double = ...
__m256d a, b;
int temp = 0;
//This bit would be in a loop over the array
//I'd probably put a sentinel in and loop over while !temp
a = _mm256_loadu_pd(data);
b = _mm256_cmp_pd(a, a, _CMP_NEQ_UQ);
temp = temp | _mm256_movemask_pd(b);
However, in some of the examples of comparison it looks like there is some sort of NaN detection already going on in addition to the comparison itself. I briefly thought, well if something like _CMP_EQ_UQ will detect NaNs, I can just use that and then I can compare 4 doubles to 4 doubles and magically look at 8 doubles at once at the same time.
__m256d a, b, c;
a = _mm256_loadu_pd(data);
b = _mm256_loadu_pd(data+4);
c = _mm256_cmp_pd(a, b, _CMP_EQ_UQ);
At this point I realized I wasn't quite thinking straight because I might happen to compare a number to itself that is not a NaN (i.e. 3 == 3) and get a hit that way.
So my question is, is comparing 4 doubles to themselves (as done above) the best I can do or is there some other better approach to finding out whether my array has a NaN?
You might be able to avoid this entirely by checking fenv status, or if not then cache block it and/or fold it into another pass over the same data, because it's very low computational intensity (work per byte loaded/stored), so it easily bottlenecks on memory bandwidth. See below.
The comparison predicate you're looking for is _CMP_UNORD_Q or _CMP_ORD_Q to tell you that the comparison is unordered or ordered, i.e. that at least one of the operands is a NaN, or that both operands are non-NaN, respectively. What does ordered / unordered comparison mean?
The asm docs for cmppd list the predicates and have equal or better details than the intrinsics guide.
So yes, if you expect NaN to be rare and want to quickly scan through lots of non-NaN values, you can vcmppd two different vectors against each other. If you cared about where the NaN was, you could do extra work to sort that out once you know that there is at least one in either of two input vectors. (Like _mm256_cmp_pd(a,a, _CMP_UNORD_Q) to feed movemask + bitscan for lowest set bit.)
OR or AND multiple compares per movemask
Like with other SSE/AVX search loops, you can also amortize the movemask cost by combining a few compare results with _mm256_or_pd (find any unordered) or _mm256_and_pd (check for all ordered). E.g. check a couple cache lines (4x _mm256d with 2x _mm256_cmp_pd) per movemask / test/branch. (glibc's asm memchr and strlen use this trick.) Again, this optimizes for your common case where you expect no early-outs and have to scan the whole array.
Also remember that it's totally fine to check the same element twice, so your cleanup can be simple: a vector that loads up to the end of the array, potentially overlapping with elements you already checked.
// checks 4 vectors = 16 doubles
// non-zero means there was a NaN somewhere in p[0..15]
static inline
int any_nan_block(double *p) {
__m256d a = _mm256_loadu_pd(p+0);
__m256d abnan = _mm256_cmp_pd(a, _mm256_loadu_pd(p+ 4), _CMP_UNORD_Q);
__m256d c = _mm256_loadu_pd(p+8);
__m256d cdnan = _mm256_cmp_pd(c, _mm256_loadu_pd(p+12), _CMP_UNORD_Q);
__m256d abcdnan = _mm256_or_pd(abnan, cdnan);
return _mm256_movemask_pd(abcdnan);
}
// more aggressive ORing is possible but probably not needed
// especially if you expect any memory bottlenecks.
I wrote the C as if it were assembly, one instruction per source line. (load / memory-source cmppd). These 6 instructions are all single-uop in the fused-domain on modern CPUs, if using non-indexed addressing modes on Intel. test/jnz as a break condition would bring it up to 7 uops.
In a loop, an add reg, 16*8 pointer increment is another 1 uop, and cmp / jne as a loop condition is one more, bringing it up to 9 uops. So unfortunately on Skylake this bottlenecks on the front-end at 4 uops / clock, taking at least 9/4 cycles to issue 1 iteration, not quite saturating the load ports. Zen 2 or Ice Lake could sustain 2 loads per clock without any more unrolling or another level of vorpd combining.
Another trick that might be possible is to use vptest or vtestpd on two vectors to check that they're both non-zero. But I'm not sure it's possible to correctly check that every element of both vectors is non-zero. Can PTEST be used to test if two registers are both zero or some other condition? shows that the other way (that _CMP_UNORD_Q inputs are both all-zero) is not possible.
But this wouldn't really help: vtestpd / jcc is 3 uops total, vs. vorpd / vmovmskpd / test+jcc also being 3 fused-domain uops on existing Intel/AMD CPUs with AVX, so it's not even a win for throughput when you're branching on the result. So even if it's possible, it's probably break even, although it might save a bit of code size. And wouldn't be worth considering if it takes more than one branch to sort out the all-zeros or mix_zeros_and_ones cases from the all-ones case.
Avoiding work: check fenv flags instead
If your array was the result of computation in this thread, just check the FP exception sticky flags (in MXCSR manually, or via fenv.h fegetexcept) to see if an FP "invalid" exception has happened since you last cleared FP exceptions. If not, I think that means the FPU hasn't produced any NaN outputs and thus there are none in arrays written since then by this thread.
If it is set, you'll have to check; the invalid exception might have been raised for a temporary result that didn't propagate into this array.
Cache blocking:
If/when fenv flags don't let you avoid the work entirely, or aren't a good strategy for your program, try to fold this check into whatever produced the array, or into the next pass that reads it. So you're reusing data while it's already loaded into vector registers, increasing computational intensity. (ALU work per load/store.)
Even if data is already hot in L1d, it will still bottleneck on load port bandwidth: 2 loads per cmppd still bottlenecks on 2/clock load port bandwidth, on CPUs with 2/clock vcmppd ymm (Skylake but not Haswell).
Also worthwhile to align your pointers to make sure you're getting full load throughput from L1d cache, especially if data is sometimes already hot in L1d.
Or at least cache-block it so you check a 128kiB block before running another loop on that same block while it's hot in cache. That's half the size of 256k L2 so your data should still be hot from the previous pass, and/or hot for the next pass.
Definitely avoid running this over a whole multi-megabyte array and paying the cost of getting it into the CPU core from DRAM or L3 cache, then evicting again before another loop reads it. That's worst case computational intensity, paying the cost of getting it into a CPU core's private cache more than once.

Is there a loop construct that repeats n times without calculating some conditional?

This question arose out of the context of optimizing code to remove potential branch prediction failures.. in fact, removing branches all together.
For my example, a typical for-loop uses the following syntax:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
int main()
{
bool *S = calloc(N + 1, sizeof(bool));
int p = 2;
for (int i = p * p; i <= N; i += p)
S[i] = 1;
return 0;
}
As I understand it, the assembly code produced would contain some sort of JMP instruction to check whether i <= N.
The only way I can conceive doing this in assembly would be to just repeat the same assembly instructions n times with i being incremented in each repetition. However, for large n, the binary produced would be huge.
So, I'm wondering, is there some loop construct that repeats n times without calculating some conditional?
Fully unrolling is the only practical way, and you're right that it's not appropriate for huge iteration counts (and usually not for loops where the iteration count isn't a compile-time constant). It's great for small loops with a known iteration count.
For your specific case, of setting memory to 1 over a given distance, x86 has rep stosb, which implements memset in microcode. It doesn't suffer from branch misprediction on current Intel CPUs, because microcode branches can't be predicted / speculatively executed and stall instead (or something), which means it has about 15 cycles of startup overhead before doing any stores. See What setup does REP do? For large
aligned buffers, rep stosb/d/q is pretty close to an optimized AVX loop.
So you don't get a mispredict, but you do get a fixed startup overhead. (I think this stalls issue / execute of instructions after the rep stos, because the microcode sequencer takes over the front end until it's done issuing all the uops. And it can't know when it's done until it's executed some that look at rcx. Issue happens in-order, so later independent instructions can't even get into the out-of-order part of the core until after rep stos figures out which uops to issue. The stores don't have to execute, but the microcode branches do.) Icelake is supposed to have "fast short REP MOV", which may finally solve the startup-overhead issue. Possibly by adding a dedicated hardware state-machine for rep movs and rep stos, like #KrazyGlew wishes he had when designing fast-strings in Intel's P6 uarch (ppro/PII/PIII), the ancestor of current Intel CPUs which still use very similar microcoded implementations. AFAIK, AMD is similar, but I haven't seen numbers or details for their rep stos startup overhead.
Most architectures don't have a single-instruction memset like that, so x86 is definitely a special case in computer architecture. But some old computers (like Atari ST) could have a "blitter" chip to offload copying memory, especially for graphics purposes. They'd use DMA to do the copying separately from the CPU altogether. This would be very similar to building a hardware state machine on-chip to run rep movs.
Misprediction of normal loop branches
Consider a normal asm loop, like
.looptop: do {
# loop body includes a pointer increment
# may be unrolled some to do multiple stores per conditional branch
cmp rdi, rdx
jb .looptop } while(dst < end_dst);
The branch ends up strongly predicted-taken after the loop has run a few time.
For large iteration counts, one branch mispredict is amortized over all the loop iterations and is typically negligible. (The conditional branch at the bottom of the loop is predicted taken, so the loop branch mispredicts the one time it's not taken.)
Some branch predictors have special support for loop branches, with a pattern predictor that can count patterns like taken 30 times / not taken. With some large limit for the max iteration count they can correctly predict.
Or a modern TAGE predictor (like in Intel Sandybridge-family) uses branch history to index the entry, so it "naturally" gives you some pattern prediction. For example, Skylake correctly predicts the loop-exit branch for inner loops up to about 22 iterations in my testing. (When there's a simple outer loop that re-runs the inner loop with the same iteration count repeatedly.) I tested in asm, not C, so I had control of how much unrolling was done.
A medium-length loop that's too long for the exit to be predicted correctly is the worst case for this. It's short enough that a mispredict on loop exit happens frequently, if it's an inner loop that repeatedly runs ~30 iterations on a CPU that can't predict that, for example.
The cost of one mispredict on the last iteration can be quite low. If out-of-order execution can "see" a few iterations ahead for a simple loop counter, it can have the branch itself executed before spending much time on real work beyond that mispredicted branch. With fast recovery for branch misses (not flushing the whole out-of-order pipeline by using something like a Branch Order Buffer), you still lose the chance to get started on independent work after the loop for a few cycles. This is most likely if the loop bottlenecks on latency of a dependency chain, but the counter itself is a separate chain. This paper about the real cost of branch misses is also interesting, and mentions this point.
(Note that I'm assuming that branch prediction is already "primed", so the first iteration correctly predicts the loop branch as taken.)
Impractical ways:
#Hadi linked an amusing idea: instead of running code normally, compile in a weird way where control and instructions are all data, like for example x86 using only mov instructions with x86 addressing modes + registers. (and an unconditional branch at the bottom of a big block). Is it possible to make decisions in assembly without using `jump` and `goto` at all? This is hilariously inefficient, but doesn't actually have any conditional branches: everything is a data dependency.
It uses different forms of MOV (between registers, and immediate to register, as well as load/store), so it's not a one-instruction-set computer.
A less-insane version of this is an interpreter: Different instructions / operations in the code being interpreted turn into control dependencies in the interpreter (creating a hard-to-solve efficiency problem for the interpret, see Darek Mihocka's "The Common CPU Interpreter Loop Revisited" article, but data and control flow in the guest code are both data in the interpreter. The guest program counter is just another piece of data in the interpreter.

How can I test if CRC32C is a "good" random generator?

I recently discovered that the _mm_crc32_* intel intrinsic instruction could be used to generate (pseudo-)random 32 bit numbers.
#include <nmmintrin.h> /* needs CRC32C instruction from SSE4.2 instruction set extension */
uint32_t rnd = 1; /* initialize with seed != 0 */
/* period length is 4,294,967,295 = 2^32-1 */
while (1) {
#if 0 // this was faster but worse than xorshift32 (fails more tests)
// rnd = _mm_crc32_u8(rnd, rnd >> 3);
#else // this is faster and better than xorshift32 (fails fewer tests)
rnd = _mm_crc32_u32(rnd, rnd << 18);
#endif
printf("%08X\n", rnd);
}
This method is as fast as a LCG, and faster than xorshift32. Wikipedia says that because xorshift generators "fail a few statistical tests, they have been accused of being unreliable".
Now I am wondering if the CRC32C method passes the various tests that are done on random number generators. I only verified that each bit, even the LSB, is "random" by trying to compress using PAQ8 compressors (which failed). Can someone help me do better tests?
EDIT: Using tests from the suggested TestU01 suite, the method that I previously used turned out to be worse than xorshift32. I have updated the source code above in case anyone is interested in using the better version.
This is an interesting question. Ultimately the only test that counts is "does this rng yield correct results for the problem I'm working on". What are you hoping to do with the rng?
In order to avoid answering that question for every different problem, various tests have been devised. See for example the "Diehard" tests devised by George Marsaglia. A web search for "marsaglia random number generator tests" turns up several interesting links.
I think Marsaglia's work is a couple of decades old at this point. I don't know if there has been more work on this topic since then. My guess is that for non-cryptographic purposes, an rng which passes the Diehard tests is probably sufficient.
There's a big difference in the requirements for a PRNG for a video game (especially single-player) vs. a monte-carlo simulation. Small biases can be a problem for scientific numerical computing, but generally not for a game, especially if numbers from the same PRNG are used in different ways.
There's a reason that different PRNGs with different speed / quality tradeoffs exist.
This one is very fast, especially if the seed / state stays in a register, taking only 2 or 3 uops on a modern Intel CPU. So it's fantastic if it can inline into a loop. Compared to anything else of the same speed, it's probably better quality. But compared to something only a little bit slower with larger state, it's probably pathetic if you care about statistical quality.
On x86 with BMI2, each RNG step should only require rorx edx, eax, 3 / crc32 eax, dl. On Haswell/Skylake, that's 2 uops with total latency = 1 + 3 cycles for the loop-carried dependency. (http://agner.org/optimize/). Or 3 uops without BMI2, for mov edx, eax / shr edx,3 / crc32 eax, dl, but still only 4 cycles of latency on CPUs with zero-latency mov for GP registers: Ivybridge+ and Ryzen.
2 uops is negligible impact on surrounding code in the normal case where you do enough work with each PRNG result that the 4-cycle dependency chain isn't a bottleneck. (Or ~9 cycle if your compiler stores/reloads the PRNG state inside the loop instead of keeping it live in a register and sinking the store to the global to after a loop, costing you 2 extra 1-uop instructions).
On Ryzen, crc32 is 3 uops with 3c total latency, so more impact on surrounding code but the same one per 4 clock bottleneck if you're doing so little with the PRNG results that you bottleneck on that.
I suspect you may have been benchmarking the loop-carried dependency chain bottleneck, not the impact on real surrounding code that does enough work to hide that latency. (Almost all relevant x86 CPUs are out-of-order execution.) Making an RNG even cheaper than xorshift128+, or even xorshift128, is likely to be of negligible benefit for most use-cases. xorshift128+ or xorshift128* is fast, and quite good quality for the speed.
If you want lots of PRNG results very quickly, consider using a SIMD xorshift128+ to run two or four generators in parallel (in different elements of XMM or YMM vectors). Especially if you can usefully use a __m256i vector of PRNG results. See AVX/SSE version of xorshift128+, and also this answer where I used it.
Returning the entire state as the RNG result is usually a bad thing, because it means that one value tells you exactly what the next one will be. i.e. 3 is always followed by 1897987234 (fake numbers), never 3 followed by something else. Most statistical quality tests should pick this up, but this might or might not be a problem for any given use-case.
Note that https://en.wikipedia.org/wiki/Xorshift is saying that even xorshift128 fails a few statistical tests. I assume xorshift32 is significantly worse. CRC32c is also based on XOR and shift (but also with bit-reflect and modulo in Galois Field(2)), so it's reasonable to think it might be similar or better in quality.
You say your choice of crc32(rnd, rnd>>3) gives a period of 2^32, and that's the best you can do with a state that small. (Of course rnd++ achieves the same period, so it's not the only measure of quality.) It's likely at least as good as an LCG, but those are not considered high quality, especially if the modulus is 2^32 (so you get it for free from fixed-width integer math).
I tested with a bitmap of previously seen values (NASM source), incrementing a counter until we reach a bitmap entry we've already seen. Indeed, all 2^32 values are seen before coming back to the original value. Since the period is exactly 0x100000000, that rules out any special values that are part of a shorter cycle.
Without memory access to the bitmap, Skylake does indeed run the loop at 4 cycles per iteration, as expected from the latency bottleneck. And rorx/crc32 is just 2 total uops.
One measure of the goodness of a PRNG is the length of the cycle. If that matters for your application, then a CRC-32 as you are using it would not be a good choice, since the cycle is only 232. One consequence is that if you are using more samples than that, which doesn't take very long, your results will repeat. Another is that there is a correlation between successive CRC-32 values, where there is only one possible value that will follow the current value.
Better PRNGs have exponentially longer cycles, and the values returned are smaller than the bits in the state, so that successive values do not have that correlation.
You don't need to use a CRC-32C instruction to be fast. Also you don't need to design your own PRNG, which is fraught with hidden perils. Best to leave that to the professionals. See this work for high-quality, small, and fast random number generators.

(n - Multiplication) vs (n/2 - multiplication + 2 additions) which is better?

I have a C program that has n multiplications (single multiplication with n iterations) and I found another logic that has n/2 iterations of (1 multiplication + 2 additions). I know about the complexity that both are of O(n). but in terms of CPU cycles. which is faster ?
Test on your computer. Or, look at the specs for your processor and guess.
The old logic no longer applies: on modern processors, an integer multiplication might be very cheap, on some newish Intel processors it's 3 clock cycles. Additions are 1 cycle on these same processors. However, in a modern pipelined processor, the stalls created by data dependencies might cause additions to take longer.
My guess is that N additions + N/2 multiplications is slower than N multiplications if you are doing a fold type operation, and I would guess the reverse for a map type operation. But this is only a guess.
Test if you want the truth.
However: Most algorithms this simple are memory-bound, and both will be the same speed.
First of all follow Dietrich Epp's first advice - measuring is (at least for complex optimization problems) the only way to be sure.
Now if you want to figure out why one is faster than the other, we can try. There are two different important performance measures: Latency and reciprocal throughput. A short summary of the two:
Latency: This is the delay that the instruction generates in a
dependency chain. The numbers are minimum values. Cache misses,
misalignment, and exceptions may increase the clock counts
considerably. Where hyperthreading is enabled, the use of the same
execution units in the other thread leads to inferior performance.
Denormal numbers, NAN’s and infinity do not increase the latency. The
time unit used is core clock cycles, not the reference clock cycles
given by the time stamp counter.
Reciprocal throughput: The average number of core clock cycles per
instruction for a series of independent instructions of the same kind
in the same thread.
For Sandy bridge the rec. throughput for an add r, r/i (for further notice r=register, i=immediate, m=memory) is 0.33 while the latency is 1.
An imul r, r has a latency of 3 and a rec. throughput of 1.
So as you see it completely depends on your specific algorithm - if you can just replace one imul with two independent adds this particular part of your algorithm could get a theoretical speedup of 50% (and in the best case obviously a speedup of ~350%). But on the other hand if your adds add a problematic dependency one imul could be just as fast as one add.
Also note that we've ignored all the additional complications like memory and cache behavior (things which will generally have a much, MUCH larger influence on the execution time) or intricate stuff like µop fusion and whatnot. In general the only people that should care about this stuff are compiler writers - it's much simpler to just measure the result of their efforts ;)
Anyways if you want a good listing of this stuff see this here (the above description of latency/rec. throughput is also from that particular document).

Resources