Efficiently dividing unsigned value by a power of two, rounding up

Efficiently dividing unsigned value by a power of two, rounding up - c

I want to implement unsigneda integer division by an arbitrary power of two, rounding up, efficiently. So what I want, mathematically, is ceiling(p/q)0. In C, the strawman implementation, which doesn't take advantage of the restricted domain of q could something like the following function1:
/** q must be a power of 2, although this version works for any q */
uint64_t divide(uint64_t p, uint64_t q) {
uint64_t res = p / q;
return p % q == 0 ? res : res + 1;
}
... of course, I don't actually want to use division or mod at the machine level, since that takes many cycles even on modern hardware. I'm looking for a strength reduction that uses shifts and/or some other cheap operation(s) - taking advantage of the fact that q is a power of 2.
You can assume we have an efficient lg(unsigned int x) function, which returns the base-2 log of x, if x is a power-of-two.
Undefined behavior is fine if q is zero.
Please note that the simple solution: (p+q-1) >> lg(q) doesn't work in general - try it with p == 2^64-100 and q == 2562 for example.
Platform Details
I'm interested in solutions in C, that are likely to perform well across a variety of platforms, but for the sake of concreteness, awarding the bounty and because any definitive discussion of performance needs to include a target architecture, I'll be specific about how I'll test them:
Skylake CPU
gcc 5.4.0 with compile flags -O3 -march=haswell
Using gcc builtins (such as bitscan/leading zero builtins) is fine, and in particular I've implemented the lg() function I said was available as follows:
inline uint64_t lg(uint64_t x) {
return 63U - (uint64_t)__builtin_clzl(x);
}
inline uint32_t lg32(uint32_t x) {
return 31U - (uint32_t)__builtin_clz(x);
}
I verified that these compile down to a single bsr instruction, at least with -march=haswell, despite the apparent involvement of a subtraction. You are of course free to ignore these and use whatever other builtins you want in your solution.
Benchmark
I wrote a benchmark for the existing answers, and will share and update the results as changes are made.
Writing a good benchmark for a small, potentially inlined operation is quite tough. When code is inlined into a call site, a lot of the work of the function may disappear, especially when it's in a loop3.
You could simply avoid the whole inlining problem by ensuring your code isn't inlined: declare it in another compilation unit. I tried to that with the bench binary, but really the results are fairly pointless. Nearly all implementations tied at 4 or 5 cycles per call, but even a dummy method that does nothing other than return 0 takes the same time. So you are mostly just measuring the call + ret overhead. Furthermore, you are almost never really going to use the functions like this - unless you messed up, they'll be available for inlining and that changes everything.
So the two benchmarks I'll focus the most on repeatedly call the method under test in a loop, allowing inlining, cross-function optmization, loop hoisting and even vectorization.
There are two overall benchmark types: latency and throughput. The key difference is that in the latency benchmark, each call to divide is dependent on the previous call, so in general calls cannot be easily overlapped4:
uint32_t bench_divide_latency(uint32_t p, uint32_t q) {
uint32_t total = p;
for (unsigned i=0; i < ITERS; i++) {
total += divide_algo(total, q);
q = rotl1(q);
}
return total;
}
Note that the running total depends so on the output of each divide call, and that it is also an input to the divide call.
The throughput variant, on the other hand, doesn't feed the output of one divide into the subsequent one. This allows work from one call to be overlapped with a subsequent one (both by the compiler, but especially the CPU), and even allows vectorization:
uint32_t bench_divide_throughput(uint32_t p, uint32_t q) {
uint32_t total = p;
for (unsigned i=0; i < ITERS; i++) {
total += fname(i, q);
q = rotl1(q);
}
return total;
}
Note that here we feed in the loop counter as the the dividend - this is variable, but it doesn't depend on the previous divide call.
Furthermore, each benchmark has three flavors of behavior for the divisor, q:
Compile-time constant divisor. For example, a call to divide(p, 8). This is common in practice, and the code can be much simpler when the divisor is known at compile time.
Invariant divisor. Here the divisor is not know at compile time, but is constant for the whole benchmarking loop. This allows a subset of the optimizations that the compile-time constant does.
Variable divisor. The divisor changes on each iteration of the loop. The benchmark functions above show this variant, using a "rotate left 1" instruction to vary the divisor.
Combining everything you get a total of 6 distinct benchmarks.
Results
Overall
For the purposes of picking an overall best algorithm, I looked at each of 12 subsets for the proposed algorithms: (latency, throughput) x (constant a, invariant q, variable q) x (32-bit, 64-bit) and assigned a score of 2, 1, or 0 per subtest as follows:
The best algorithm(s) (within 5% tolerance) receive a score of 2.
The "close enough" algorithms (no more than 50% slower than the best) receive a score of 1.
The remaining algorithms score zero.
Hence, the maximum total score is 24, but no algorithm achieved that. Here are the overall total results:
╔═══════════════════════╦═══════╗
║ Algorithm ║ Score ║
╠═══════════════════════╬═══════╣
║ divide_user23_variant ║ 20 ║
║ divide_chux ║ 20 ║
║ divide_user23 ║ 15 ║
║ divide_peter ║ 14 ║
║ divide_chrisdodd ║ 12 ║
║ stoke32 ║ 11 ║
║ divide_chris ║ 0 ║
║ divide_weather ║ 0 ║
╚═══════════════════════╩═══════╝
So the for the purposes of this specific test code, with this specific compiler and on this platform, user2357112 "variant" (with ... + (p & mask) != 0) performs best, tied with chux's suggestion (which is in fact identical code).
Here are all the sub-scores which sum to the above:
╔══════════════════════════╦═══════╦════╦════╦════╦════╦════╦════╗
║ ║ Total ║ LC ║ LI ║ LV ║ TC ║ TI ║ TV ║
╠══════════════════════════╬═══════╬════╬════╬════╬════╬════╬════╣
║ divide_peter ║ 6 ║ 1 ║ 1 ║ 1 ║ 1 ║ 1 ║ 1 ║
║ stoke32 ║ 6 ║ 1 ║ 1 ║ 2 ║ 0 ║ 0 ║ 2 ║
║ divide_chux ║ 10 ║ 2 ║ 2 ║ 2 ║ 1 ║ 2 ║ 1 ║
║ divide_user23 ║ 8 ║ 1 ║ 1 ║ 2 ║ 2 ║ 1 ║ 1 ║
║ divide_user23_variant ║ 10 ║ 2 ║ 2 ║ 2 ║ 1 ║ 2 ║ 1 ║
║ divide_chrisdodd ║ 6 ║ 1 ║ 1 ║ 2 ║ 0 ║ 0 ║ 2 ║
║ divide_chris ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║
║ divide_weather ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║
║ ║ ║ ║ ║ ║ ║ ║ ║
║ 64-bit Algorithm ║ ║ ║ ║ ║ ║ ║ ║
║ divide_peter_64 ║ 8 ║ 1 ║ 1 ║ 1 ║ 2 ║ 2 ║ 1 ║
║ div_stoke_64 ║ 5 ║ 1 ║ 1 ║ 2 ║ 0 ║ 0 ║ 1 ║
║ divide_chux_64 ║ 10 ║ 2 ║ 2 ║ 2 ║ 1 ║ 2 ║ 1 ║
║ divide_user23_64 ║ 7 ║ 1 ║ 1 ║ 2 ║ 1 ║ 1 ║ 1 ║
║ divide_user23_variant_64 ║ 10 ║ 2 ║ 2 ║ 2 ║ 1 ║ 2 ║ 1 ║
║ divide_chrisdodd_64 ║ 6 ║ 1 ║ 1 ║ 2 ║ 0 ║ 0 ║ 2 ║
║ divide_chris_64 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║
║ divide_weather_64 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║ 0 ║
╚══════════════════════════╩═══════╩════╩════╩════╩════╩════╩════╝
Here, each test is named like XY, with X in {Latency, Throughput} and Y in {Constant Q, Invariant Q, Variable Q}. So for example, LC is "Latency test with constant q".
Analysis
At the highest level, the solutions can be roughly divided into two categories: fast (the top 6 finishers) and slow (the bottom two). The difference is larger: all of the fast algorithms were the fastest on at least two subtests and in general when they didn't finish first they fell into the "close enough" category (they only exceptions being failed vectorizations in the case of stoke and chrisdodd). The slow algorithms however scored 0 (not even close) on every test. So you can mostly eliminate the slow algorithms from further consideration.
Auto-vectorization
Among the fast algorithms, a large differentiator was the ability to auto-vectorize.
None of the algorithms were able to auto-vectorize in the latency tests, which makes sense since the latency tests are designed to feed their result directly into the next iteration. So you can really only calculate results in a serial fashion.
For the throughput tests, however, many algorithms were able to auto-vectorize for the constant Q and invariant Q case. In both of these tests tests the divisor q is loop-invariant (and in the former case it is a compile-time constant). The dividend is the loop counter, so it is variable, but predicable (and in particular a vector of dividends can be trivially calculated by adding 8 to the previous input vector: [0, 1, 2, ..., 7] + [8, 8, ..., 8] == [8, 9, 10, ..., 15]).
In this scenario, gcc was able to vectorize peter, stoke, chux, user23 and user23_variant. It wasn't able to vectorize chrisdodd for some reason, likely because it included a branch (but conditionals don't strictly prevent vectorization since many other solutions have conditional elements but still vectorized). The impact was huge: algorithms that vectorized showed about an 8x improvement in throughput over variants that didn't but were otherwise fast.
Vectorization isn't free, though! Here are the function sizes for the "constant" variant of each function, with the Vec? column showing whether a function vectorized or not:
Size Vec? Name
045 N bench_c_div_stoke_64
049 N bench_c_divide_chrisdodd_64
059 N bench_c_stoke32_64
212 Y bench_c_divide_chux_64
227 Y bench_c_divide_peter_64
220 Y bench_c_divide_user23_64
212 Y bench_c_divide_user23_variant_64
The trend is clear - vectorized functions take about 4x the size of the non-vectorized ones. This is both because the core loops themselves are larger (vector instructions tend to be larger and there are more of them), and because loop setup and especially the post-loop code is much larger: for example, the vectorized version requires a reduction to sum all the partial sums in a vector. The loop count is fixed and a multiple of 8, so no tail code is generated - but if were variable the generated code would be even larger.
Furthermore, despite the large improvement in runtime, gcc's vectorization is actually poor. Here's an excerpt from the vectorized version of Peter's routine:
on entry: ymm4 == all zeros
on entry: ymm5 == 0x00000001 0x00000001 0x00000001 ...
4007a4: c5 ed 76 c4 vpcmpeqd ymm0,ymm2,ymm4
4007ad: c5 fd df c5 vpandn ymm0,ymm0,ymm5
4007b1: c5 dd fa c0 vpsubd ymm0,ymm4,ymm0
4007b5: c5 f5 db c0 vpand ymm0,ymm1,ymm0
This chunk works independently on 8 DWORD elements originating in ymm2. If we take x to be a single DWORD element of ymm2, and y the incoming value of ymm1 these foud instructions correspond to:
x == 0 x != 0
x = x ? 0 : -1; // -1 0
x = x & 1; // 1 0
x = 0 - x; // -1 0
x = y1 & x; // y1 0
So the first three instructions could simple be replaced by the first one, as the states are identical in either case. So that's two cycles added to that dependency chain (which isn't loop carried) and two extra uops. Evidently gcc's optimization phases somehow interact poorly with the vectorization code here, since such trivial optimizations are rarely missed in scalar code. Examining the other vectorized versions similarly shows a lot of performance dropped on the floor.
Branches vs Branch-free
Nearly all of the solutions compiled to branch-free code, even if C code had conditionals or explicit branches. The conditional portions were small enough that the compiler generally decided to use conditional move or some variant. One exception is chrisdodd which compiled with a branch (checking if p == 0) in all the throughput tests, but none of the latency ones. Here's a typical example from the constant q throughput test:
0000000000400e60 <bench_c_divide_chrisdodd_32>:
400e60: 89 f8 mov eax,edi
400e62: ba 01 00 00 00 mov edx,0x1
400e67: eb 0a jmp 400e73 <bench_c_divide_chrisdodd_32+0x13>
400e69: 0f 1f 80 00 00 00 00 nop DWORD PTR [rax+0x0]
400e70: 83 c2 01 add edx,0x1
400e73: 83 fa 01 cmp edx,0x1
400e76: 74 f8 je 400e70 <bench_c_divide_chrisdodd_32+0x10>
400e78: 8d 4a fe lea ecx,[rdx-0x2]
400e7b: c1 e9 03 shr ecx,0x3
400e7e: 8d 44 08 01 lea eax,[rax+rcx*1+0x1]
400e82: 81 fa 00 ca 9a 3b cmp edx,0x3b9aca00
400e88: 75 e6 jne 400e70 <bench_c_divide_chrisdodd_32+0x10>
400e8a: c3 ret
400e8b: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
The branch at 400e76 skips the case that p == 0. In fact, the compiler could have just peeled the first iteration out (calculating its result explicitly) and then avoided the jump entirely since after that it can prove that p != 0. In these tests, the branch is perfectly predictable, which could give an advantage to code that actually compiles using a branch (since the compare & branch code is essentially out of line and close to free), and is a big part of why chrisdodd wins the throughput, variable q case.
Detailed Test Results
Here you can find some detailed test results and some details on the tests themselves.
Latency
The results below test each algorithm over 1e9 iterations. Cycles are calculated simply by multiplying the time/call by the clock frequency. You can generally assume that something like 4.01 is the same as 4.00, but the larger deviations like 5.11 seem to be real and reproducible.
The results for divide_plusq_32 use (p + q - 1) >> lg(q) but are only shown for reference, since this function fails for large p + q. The results for dummy are a very simple function: return p + q, and lets you estimate the benchmark overhead5 (the addition itself should take a cycle at most).
==============================
Bench: Compile-time constant Q
==============================
Function ns/call cycles
divide_peter_32 2.19 5.67
divide_peter_64 2.18 5.64
stoke32_32 1.93 5.00
stoke32_64 1.97 5.09
stoke_mul_32 2.75 7.13
stoke_mul_64 2.34 6.06
div_stoke_32 1.94 5.03
div_stoke_64 1.94 5.03
divide_chux_32 1.55 4.01
divide_chux_64 1.55 4.01
divide_user23_32 1.97 5.11
divide_user23_64 1.93 5.00
divide_user23_variant_32 1.55 4.01
divide_user23_variant_64 1.55 4.01
divide_chrisdodd_32 1.95 5.04
divide_chrisdodd_64 1.93 5.00
divide_chris_32 4.63 11.99
divide_chris_64 4.52 11.72
divide_weather_32 2.72 7.04
divide_weather_64 2.78 7.20
divide_plusq_32 1.16 3.00
divide_plusq_64 1.16 3.00
divide_dummy_32 1.16 3.00
divide_dummy_64 1.16 3.00
==============================
Bench: Invariant Q
==============================
Function ns/call cycles
divide_peter_32 2.19 5.67
divide_peter_64 2.18 5.65
stoke32_32 1.93 5.00
stoke32_64 1.93 5.00
stoke_mul_32 2.73 7.08
stoke_mul_64 2.34 6.06
div_stoke_32 1.93 5.00
div_stoke_64 1.93 5.00
divide_chux_32 1.55 4.02
divide_chux_64 1.55 4.02
divide_user23_32 1.95 5.05
divide_user23_64 2.00 5.17
divide_user23_variant_32 1.55 4.02
divide_user23_variant_64 1.55 4.02
divide_chrisdodd_32 1.95 5.04
divide_chrisdodd_64 1.93 4.99
divide_chris_32 4.60 11.91
divide_chris_64 4.58 11.85
divide_weather_32 12.54 32.49
divide_weather_64 17.51 45.35
divide_plusq_32 1.16 3.00
divide_plusq_64 1.16 3.00
divide_dummy_32 0.39 1.00
divide_dummy_64 0.39 1.00
==============================
Bench: Variable Q
==============================
Function ns/call cycles
divide_peter_32 2.31 5.98
divide_peter_64 2.26 5.86
stoke32_32 2.06 5.33
stoke32_64 1.99 5.16
stoke_mul_32 2.73 7.06
stoke_mul_64 2.32 6.00
div_stoke_32 2.00 5.19
div_stoke_64 2.00 5.19
divide_chux_32 2.04 5.28
divide_chux_64 2.05 5.30
divide_user23_32 2.05 5.30
divide_user23_64 2.06 5.33
divide_user23_variant_32 2.04 5.29
divide_user23_variant_64 2.05 5.30
divide_chrisdodd_32 2.04 5.30
divide_chrisdodd_64 2.05 5.31
divide_chris_32 4.65 12.04
divide_chris_64 4.64 12.01
divide_weather_32 12.46 32.28
divide_weather_64 19.46 50.40
divide_plusq_32 1.93 5.00
divide_plusq_64 1.99 5.16
divide_dummy_32 0.40 1.05
divide_dummy_64 0.40 1.04
Throughput
Here are the results for the throughput tests. Note that many of the algorithms here were auto-vectorized, so the performance is relatively very good for those: a fraction of a cycle in many cases. One result is that unlike most latency results, the 64-bit functions are considerably slower, since vectorization is more effective with smaller element sizes (although the gap is larger that I would have expected).
==============================
Bench: Compile-time constant Q
==============================
Function ns/call cycles
stoke32_32 0.39 1.00
divide_chux_32 0.15 0.39
divide_chux_64 0.53 1.37
divide_user23_32 0.14 0.36
divide_user23_64 0.53 1.37
divide_user23_variant_32 0.15 0.39
divide_user23_variant_64 0.53 1.37
divide_chrisdodd_32 1.16 3.00
divide_chrisdodd_64 1.16 3.00
divide_chris_32 4.34 11.23
divide_chris_64 4.34 11.24
divide_weather_32 1.35 3.50
divide_weather_64 1.35 3.50
divide_plusq_32 0.10 0.26
divide_plusq_64 0.39 1.00
divide_dummy_32 0.08 0.20
divide_dummy_64 0.39 1.00
==============================
Bench: Invariant Q
==============================
Function ns/call cycles
stoke32_32 0.48 1.25
divide_chux_32 0.15 0.39
divide_chux_64 0.48 1.25
divide_user23_32 0.17 0.43
divide_user23_64 0.58 1.50
divide_user23_variant_32 0.15 0.38
divide_user23_variant_64 0.48 1.25
divide_chrisdodd_32 1.16 3.00
divide_chrisdodd_64 1.16 3.00
divide_chris_32 4.35 11.26
divide_chris_64 4.36 11.28
divide_weather_32 5.79 14.99
divide_weather_64 17.00 44.02
divide_plusq_32 0.12 0.31
divide_plusq_64 0.48 1.25
divide_dummy_32 0.09 0.23
divide_dummy_64 0.09 0.23
==============================
Bench: Variable Q
==============================
Function ns/call cycles
stoke32_32 1.16 3.00
divide_chux_32 1.36 3.51
divide_chux_64 1.35 3.50
divide_user23_32 1.54 4.00
divide_user23_64 1.54 4.00
divide_user23_variant_32 1.36 3.51
divide_user23_variant_64 1.55 4.01
divide_chrisdodd_32 1.16 3.00
divide_chrisdodd_64 1.16 3.00
divide_chris_32 4.02 10.41
divide_chris_64 3.84 9.95
divide_weather_32 5.40 13.98
divide_weather_64 19.04 49.30
divide_plusq_32 1.03 2.66
divide_plusq_64 1.03 2.68
divide_dummy_32 0.63 1.63
divide_dummy_64 0.66 1.71
a At least by specifying unsigned we avoid the whole can of worms related to the right-shift behavior of signed integers in C and C++.
0 Of course, this notation doesn't actually work in C where / truncates the result so the ceiling does nothing. So consider that pseudo-notation rather than straight C.
1 I'm also interested solutions where all types are uint32_t rather than uint64_t.
2 In general, any p and q where p + q >= 2^64 causes an issue, due to overflow.
3 That said, the function should be in a loop, because the performance of a microscopic function that takes half a dozen cycles only really matters if it is called in a fairly tight loop.
4 This is a bit of a simplification - only the dividend p is dependent on the output of the previous iteration, so some work related to processing of q can still be overlapped.
5 Use such estimates with caution however - overhead isn't simply additive. If the overhead shows up as 4 cycles and some function f takes 5, it's likely not accurate to say the cost of the real work in f is 5 - 4 == 1, because of the way execution is overlapped.

This answer is about what's ideal in asm; what we'd like to convince the compiler to emit for us. (I'm not suggesting actually using inline asm, except as a point of comparison when benchmarking compiler output. https://gcc.gnu.org/wiki/DontUseInlineAsm).
I did manage to get pretty good asm output from pure C for ceil_div_andmask, see below. (It's worse than a CMOV on Broadwell/Skylake, but probably good on Haswell. Still, the user23/chux version looks even better for both cases.) It's mostly just worth mentioning as one of the few cases where I got the compiler to emit the asm I wanted.
It looks like Chris Dodd's general idea of return ((p-1) >> lg(q)) + 1 with special-case handling for d=0 is one of the best options. I.e. the optimal implementation of it in asm is hard to beat with an optimal implementation of anything else. Chux's (p >> lg(q)) + (bool)(p & (q-1)) also has advantages (like lower latency from p->result), and more CSE when the same q is used for multiple divisions. See below for a latency/throughput analysis of what gcc does with it.
If the same e = lg(q) is reused for multiple dividends, or the same dividend is reused for multiple divisors, different implementations can CSE more of the expression. They can also effectively vectorize with AVX2.
Branches are cheap and very efficient if they predict very well, so branching on d==0 will be best if it's almost never taken. If d==0 is not rare, branchless asm will perform better on average. Ideally we can write something in C that will let gcc make the right choice during profile-guided optimization, and compiles to good asm for either case.
Since the best branchless asm implementations don't add much latency vs. a branchy implementation, branchless is probably the way to go unless the branch would go the same way maybe 99% of the time. This might be likely for branching on p==0, but probably less likely for branching on p & (q-1).
It's hard to guide gcc5.4 into emitting anything optimal. This is my work-in-progress on Godbolt).
I think the optimal sequences for Skylake for this algorithm are as follows. (Shown as stand-alone functions for the AMD64 SysV ABI, but talking about throughput/latency on the assumption that the compiler will emit something similar inlined into a loop, with no RET attached).
Branch on carry from calculating d-1 to detect d==0, instead of a separate test & branch. Reduces the uop count nicely, esp on SnB-family where JC can macro-fuse with SUB.
ceil_div_pjc_branch:
xor eax,eax ; can take this uop off the fast path by adding a separate xor-and-return block, but in reality we want to inline something like this.
sub rdi, 1
jc .d_was_zero ; fuses with the sub on SnB-family
tzcnt rax, rsi ; tzcnt rsi,rsi also avoids any false-dep problems, but this illustrates that the q input can be read-only.
shrx rax, rdi, rax
inc rax
.d_was_zero:
ret
Fused-domain uops: 5 (not counting ret), and one of them is an xor-zero (no execution unit)
HSW/SKL latency with successful branch prediction:
(d==0): No data dependency on d or q, breaks the dep chain. (control dependency on d to detect mispredicts and retire the branch).
(d!=0): q->result: tzcnt+shrx+inc = 5c
(d!=0): d->result: sub+shrx+inc = 3c
Throughput: probably just bottlenecked on uop throughput
I've tried but failed to get gcc to branch on CF from the subtract, but it always wants to do a separate comparison. I know gcc can be coaxed into branching on CF after subtracting two variables, but maybe this fails if one is a compile-time constant. (IIRC, this typically compiles to a CF test with unsigned vars: foo -= bar; if(foo>bar) carry_detected = 1;)
Branchless with ADC / SBB to handle the d==0 case. Zero-handling adds only one instruction to the critical path (vs. a version with no special handling for d==0), but also converts one other from an INC to a sbb rax, -1 to make CF undo the -= -1. Using a CMOV is cheaper on pre-Broadwell, but takes extra instructions to set it up.
ceil_div_pjc_asm_adc:
tzcnt rsi, rsi
sub rdi, 1
adc rdi, 0 ; d? d-1 : d. Sets CF=CF
shrx rax, rdi, rsi
sbb rax, -1 ; result++ if d was non-zero
ret
Fused-domain uops: 5 (not counting ret) on SKL. 7 on HSW
SKL latency:
q->result: tzcnt+shrx+sbb = 5c
d->result: sub+adc+shrx(dep on q begins here)+sbb = 4c
Throughput: TZCNT runs on p1. SBB, ADC, and SHRX only run on p06. So I think we bottleneck on 3 uops for p06 per iteration, making this run at best one iteration per 1.5c.
If q and d become ready at the same time, note that this version can run SUB/ADC in parallel with the 3c latency of TZCNT. If both are coming from the same cache-miss cache line, it's certainly possible. In any case, introducing the dep on q as late as possible in the d->result dependency chain is an advantage.
Getting this from C seems unlikely with gcc5.4. There is an intrinsic for add-with-carry, but gcc makes a total mess of it. It doesn't use immediate operands for ADC or SBB, and stores the carry into an integer reg between every operation. gcc7, clang3.9, and icc17 all make terrible code from this.
#include <x86intrin.h>
// compiles to completely horrible code, putting the flags into integer regs between ops.
T ceil_div_adc(T d, T q) {
T e = lg(q);
unsigned long long dm1; // unsigned __int64
unsigned char CF = _addcarry_u64(0, d, -1, &dm1);
CF = _addcarry_u64(CF, 0, dm1, &dm1);
T shifted = dm1 >> e;
_subborrow_u64(CF, shifted, -1, &dm1);
return dm1;
}
# gcc5.4 -O3 -march=haswell
mov rax, -1
tzcnt rsi, rsi
add rdi, rax
setc cl
xor edx, edx
add cl, -1
adc rdi, rdx
setc dl
shrx rdi, rdi, rsi
add dl, -1
sbb rax, rdi
ret
CMOV to fix the whole result: worse latency from q->result, since it's used sooner in the d->result dep chain.
ceil_div_pjc_asm_cmov:
tzcnt rsi, rsi
sub rdi, 1
shrx rax, rdi, rsi
lea rax, [rax+1] ; inc preserving flags
cmovc rax, zeroed_register
ret
Fused-domain uops: 5 (not counting ret) on SKL. 6 on HSW
SKL latency:
q->result: tzcnt+shrx+lea+cmov = 6c (worse than ADC/SBB by 1c)
d->result: sub+shrx(dep on q begins here)+lea+cmov = 4c
Throughput: TZCNT runs on p1. LEA is p15. CMOV and SHRX are p06. SUB is p0156. In theory only bottlenecked on fused-domain uop throughput, so one iteration per 1.25c. With lots of independent operations, resource conflicts from SUB or LEA stealing p1 or p06 shouldn't be a throughput problem because at 1 iter per 1.25c, no port is saturated with uops that can only run on that port.
CMOV to get an operand for SUB: I was hoping I could find a way to create an operand for a later instruction that would produce a zero when needed, without an input dependency on q, e, or the SHRX result. This would help if d is ready before q, or at the same time.
This doesn't achieve that goal, and needs an extra 7-byte mov rdx,-1 in the loop.
ceil_div_pjc_asm_cmov:
tzcnt rsi, rsi
mov rdx, -1
sub rdi, 1
shrx rax, rdi, rsi
cmovnc rdx, rax
sub rax, rdx ; res += d ? 1 : -res
ret
Lower-latency version for pre-BDW CPUs with expensive CMOV, using SETCC to create a mask for AND.
ceil_div_pjc_asm_setcc:
xor edx, edx ; needed every iteration
tzcnt rsi, rsi
sub rdi, 1
setc dl ; d!=0 ? 0 : 1
dec rdx ; d!=0 ? -1 : 0 // AND-mask
shrx rax, rdi, rsi
inc rax
and rax, rdx ; zero the bogus result if d was initially 0
ret
Still 4c latency from d->result (and 6 from q->result), because the SETC/DEC happen in parallel with the SHRX/INC. Total uop count: 8. Most of these insns can run on any port, so it should be 1 iter per 2 clocks.
Of course, for pre-HSW, you also need to replace SHRX.
We can get gcc5.4 to emit something nearly as good: (still uses a separate TEST instead of setting mask based on sub rdi, 1, but otherwise the same instructions as above). See it on Godbolt.
T ceil_div_andmask(T p, T q) {
T mask = -(T)(p!=0); // TEST+SETCC+NEG
T e = lg(q);
T nonzero_result = ((p-1) >> e) + 1;
return nonzero_result & mask;
}
When the compiler knows that p is non-zero, it takes advantage and makes nice code:
// http://stackoverflow.com/questions/40447195/can-i-hint-the-optimizer-by-giving-the-range-of-an-integer
#if defined(__GNUC__) && !defined(__INTEL_COMPILER)
#define assume(x) do{if(!(x)) __builtin_unreachable();}while(0)
#else
#define assume(x) (void)(x) // still evaluate it once, for side effects in case anyone is insane enough to put any inside an assume()
#endif
T ceil_div_andmask_nonzerop(T p, T q) {
assume(p!=0);
return ceil_div_andmask(p, q);
}
# gcc5.4 -O3 -march=haswell
xor eax, eax # gcc7 does tzcnt in-place instead of wasting an insn on this
sub rdi, 1
tzcnt rax, rsi
shrx rax, rdi, rax
add rax, 1
ret
Chux / user23_variant
only 3c latency from p->result, and constant q can CSE a lot.
T divide_A_chux(T p, T q) {
bool round_up = p & (q-1); // compiles differently from user23_variant with clang: AND instead of
return (p >> lg(q)) + round_up;
}
xor eax, eax # in-place tzcnt would save this
xor edx, edx # target for setcc
tzcnt rax, rsi
sub rsi, 1
test rsi, rdi
shrx rdi, rdi, rax
setne dl
lea rax, [rdx+rdi]
ret
Doing the SETCC before TZCNT would allow an in-place TZCNT, saving the xor eax,eax. I haven't looked at how this inlines in a loop.
Fused-domain uops: 8 (not counting ret) on HSW/SKL
HSW/SKL latency:
q->result: (tzcnt+shrx(p) | sub+test(p)+setne) + lea(or add) = 5c
d->result: test(dep on q begins here)+setne+lea = 3c. (the shrx->lea chain is shorter, and thus not the critical path)
Throughput: Probably just bottlenecked on the frontend, at one iter per 2c. Saving the xor eax,eax should speed this up to one per 1.75c (but of course any loop overhead will be part of the bottleneck, because frontend bottlenecks are like that).

uint64_t exponent = lg(q);
uint64_t mask = q - 1;
// v divide
return (p >> exponent) + (((p & mask) + mask) >> exponent)
// ^ round up
The separate computation of the "round up" part avoids the overflow issues of (p+q-1) >> lg(q). Depending on how smart your compiler is, it might be possible to express the "round up" part as ((p & mask) != 0) without branching.

The efficient way of dividing by a power of 2 for an unsigned integer in C is a right shift -- shifting right one divides by two (rounding down), so shifting right by n divides by 2n (rounding down).
Now you want to round up rather than down, which you can do by first adding 2n-1, or equivalently subtracting one before the shift and adding one after (except for 0). This works out to something like:
unsigned ceil_div(unsigned d, unsigned e) {
/* compute ceil(d/2**e) */
return d ? ((d-1) >> e) + 1 : 0;
}
The conditional can be removed by using the boolean value of d for addition and subtraction of one:
unsigned ceil_div(unsigned d, unsigned e) {
/* compute ceil(d/2**e) */
return ((d - !!d) >> e) + !!d;
}
Due to its size, and the speed requirement, the function should be made static inline. It probably won't make a different for the optimizer, but the parameters should be const. If it must be shared among many files, define it in a header:
static inline unsigned ceil_div(const unsigned d, const unsigned e){...

Efficiently dividing unsigned value by a power of two, rounding up
[Re-write] given OP's clarification concerning power-of-2.
The round-up or ceiling part is easy when overflow is not a concern. Simple add q-1, then shift.
Otherwise as the possibility of rounding depends on all the bits of p smaller than q, detection of those bits is needed first before they are shifted out.
uint64_t divide_A(uint64_t p, uint64_t q) {
bool round_up = p & (q-1);
return (p >> lg64(q)) + round_up;
}
This assumes code has an efficient lg64(uint64_t x) function, which returns the base-2 log of x, if x is a power-of-two.`

My old answer didn't work if p was one more than a power of two (whoops). So my new solution, using the __builtin_ctzll() and __builtin_ffsll() functions0 available in gcc (which as a bonus, provides the fast logarithm you mentioned!):
uint64_t divide(uint64_t p,uint64_t q) {
int lp=__builtin_ffsll(p);
int lq=__builtin_ctzll(q);
return (p>>lq)+(lp<(lq+1)&&lp);
}
Note that this is assuming that a long long is 64 bits. It has to be tweaked a little otherwise.
The idea here is that if we need an overflow if and only if p has fewer trailing zeroes than q. Note that for a power of two, the number of trailing zeroes is equal to the logarithm, so we can use this builtin for the log as well.
The &&lp part is just for the corner case where p is zero: otherwise it will output 1 here.
0 Can't use __builtin_ctzll() for both because it is undefined if p==0.

If the dividend/divisor can be guaranteed not to exceed 63 (or 31) bits, you can use the following version mentioned in the question. Note how p+q could overflow if they use all 64 bit. This would be fine if the SHR instruction shifted in the carry flag, but AFAIK it doesn't.
uint64_t divide(uint64_t p, uint64_t q) {
return (p + q - 1) >> lg(q);
}
If those constraints cannot be guaranteed, you can just do the floor method and then add 1 if it would round up. This can be determined by checking if any bits in the dividend are within the range of the divisor.
Note: p&-p extracts the lowest set bit on 2s complement machines or the BLSI instruction
uint64_t divide(uint64_t p, uint64_t q) {
return (p >> lg(q)) + ( (p & -p ) < q );
}
Which clang compiles to:
bsrq %rax, %rsi
shrxq %rax, %rdi, %rax
blsiq %rdi, %rcx
cmpq %rsi, %rcx
adcq $0, %rax
retq
That's a bit wordy and uses some newer instructions, so maybe there is a way to use the carry flag in the original version. Lets see:
The RCR instruction does but seems like it would be worse ... perhaps the SHRD instruction... It would be something like this (unable to test at the moment)
xor edx, edx ;edx = 0 (will store the carry flag)
bsr rcx, rsi ;rcx = lg(q) ... could be moved anywhere before shrd
lea rax, [rsi-1] ;rax = q-1 (adding p could carry)
add rax, rdi ;rax += p (handle carry)
setc dl ;rdx = carry flag ... or xor rdx and setc
shrd rax, rdx, cl ;rax = rdx:rax >> cl
ret
1 more instruction, but should be compatible with older processors (if it works ... I'm always getting a source/destination swapped - feel free to edit)
Addendum:
I've implemented the lg() function I said was available as follows:
inline uint64_t lg(uint64_t x) {
return 63U - (uint64_t)__builtin_clzl(x);
}
inline uint32_t lg32(uint32_t x) {
return 31U - (uint32_t)__builtin_clz(x);
}
The fast log functions don't fully optimize to bsr on clang and ICC but you can do this:
#if defined(__x86_64__) && (defined(__clang__) || defined(__INTEL_COMPILER))
static inline uint64_t lg(uint64_t x){
inline uint64_t ret;
//other compilers may want bsrq here
__asm__("bsr %0, %1":"=r"(ret):"r"(x));
return ret;
}
#endif
#if defined(__i386__) && (defined(__clang__) || defined(__INTEL_COMPILER))
static inline uint32_t lg32(uint32_t x){
inline uint32_t ret;
__asm__("bsr %0, %1":"=r"(ret):"r"(x));
return ret;
}
#endif

There has already been a lot of human brainpower applied to this problem, with several variants of great answers in C along with Peter Cordes's answer which covers the best you could hope for in asm, with notes on trying to map it back to C.
So while the humans are having their kick at the can, I thought see what some brute computing power has to say! To that end, I used Stanford's STOKE superoptimizer to try to find good solutions to the 32-bit and 64-bit versions of this problem.
Usually, a superoptimizer is usually something like a brute force search through all possible instruction sequences until you find the best one by some metric. Of course, with something like 1,000 instructions that will quickly spiral out of control for more than a few instructions1. STOKE, on the hand, takes a guided randomized approach: it randomly makes mutations to an existing candidate program, evaluating at each step a cost function that takes both performance and correctness into effect. That's the one-liner anyway - there are plenty of papers if that stoked your curiosity.
So within a few minutes STOKE found some pretty interesting solutions. It found almost all the high-level ideas in the existing solutions, plus a few unique ones. For example, for the 32-bit function, STOKE found this version:
neg rsi
dec rdi
pext rax, rsi, rdi
inc eax
ret
It doesn't use any leading/trailing-zero count or shift instructions at all. Pretty much, it uses neg esi to turn the divisor into a mask with 1s in the high bits, and then pext effectively does the shift using that mask. Outside of that trick it's using the same trick that user QuestionC used: decrement p, shift, increment p - but it happens to work even for zero dividend because it uses 64-bit registers to distinguish the zero case from the MSB-set large p case.
I added the C version of this algorithm to the benchmark, and added it to the results. It's competitive with the other good algorithms, tying for first in the "Variable Q" cases. It does vectorize, but not as well as the other 32-bit algorithms, because it needs 64-bit math and so the vectors can process only half as many elements at once.
Even better, in the 32-bit case it came up with a variety of solutions which use the fact that some of the intuitive solutions that fail for edge cases happen to "just work" if you use 64-bit ops for part of it. For example:
tzcntl ebp, esi
dec esi
add rdi, rsi
sarx rax, rdi, rbp
ret
That's the equivalent of the return (p + q - 1) >> lg(q) suggestion I mentioned in the question. That doesn't work in general since for large p + q it overflows, but for 32-bit p and q this solution works great by doing the important parts in 64-bit. Convert that back into C with some casts and it actually figures out that using lea will do the addition in one instruction1:
stoke_32(unsigned int, unsigned int):
tzcnt edx, esi
mov edi, edi ; goes away when inlining
mov esi, esi ; goes away when inlining
lea rax, [rsi-1+rdi]
shrx rax, rax, rdx
ret
So it's a 3-instruction solution when inlined into something that already has the values zero-extended into rdi and rsi. The stand-alone function definition needs the mov instructions to zero-extend because that's how the SysV x64 ABI works.
For the 64-bit function it didn't come up with anything that blows away the existing solutions but it did come up with some neat stuff, like:
tzcnt r13, rsi
tzcnt rcx, rdi
shrx rax, rdi, r13
cmp r13b, cl
adc rax, 0
ret
That guy counts the trailing zeros of both arguments, and then adds 1 to the result if q has fewer trailing zeros than p, since that's when you need to round up. Clever!
In general, it understood the idea that you needed to shl by the tzcnt really quickly (just like most humans) and then came up with a ton of other solutions to the problem of adjusting the result to account for rounding. It even managed to use blsi and bzhi in several solutions. Here's a 5-instruction solution it came up with:
tzcnt r13, rsi
shrx rax, rdi, r13
imul rsi, rax
cmp rsi, rdi
adc rax, 0
ret
It's a basically a "multiply and verify" approach - take the truncated res = p \ q, multiply it back and if it's different than p add one: return res * q == p ? ret : ret + 1. Cool. Not really better than Peter's solutions though. STOKE seems to have some flaws in its latency calculation - it thinks the above has a latency of 5 - but it's more like 8 or 9 depending on the architecture. So it sometimes narrows in solutions that are based on its flawed latency calculation.
1 Interestingly enough though this brute force approach reaches its feasibility around 5-6 instructions: if you assume you can trim the instruction count to say 300 by eliminating SIMD and x87 instructions, then you would need ~28 days to try all 300 ^ 5 5 instruction sequences at 1,000,000 candidates/second. You could perhaps reduce that by a factor of 1,000 with various optimizations, meaning less than an hour for 5-instruction sequences and maybe a week for 6-instruction. As it happens, most of the best solutions for this problem fall into that 5-6 instruction window...
2 This will be a slow lea, however, so the sequence found by STOKE was still optimal for what I optimized for, which was latency.

You can do it like this, by comparing dividing n / d with dividing by (n-1) / d.
#include <stdio.h>
int main(void) {
unsigned n;
unsigned d;
unsigned q1, q2;
double actual;
for(n = 1; n < 6; n++) {
for(d = 1; d < 6; d++) {
actual = (double)n / d;
q1 = n / d;
if(n) {
q2 = (n - 1) / d;
if(q1 == q2) {
q1++;
}
}
printf("%u / %u = %u (%f)\n", n, d, q1, actual);
}
}
return 0;
}
Program output:
1 / 1 = 1 (1.000000)
1 / 2 = 1 (0.500000)
1 / 3 = 1 (0.333333)
1 / 4 = 1 (0.250000)
1 / 5 = 1 (0.200000)
2 / 1 = 2 (2.000000)
2 / 2 = 1 (1.000000)
2 / 3 = 1 (0.666667)
2 / 4 = 1 (0.500000)
2 / 5 = 1 (0.400000)
3 / 1 = 3 (3.000000)
3 / 2 = 2 (1.500000)
3 / 3 = 1 (1.000000)
3 / 4 = 1 (0.750000)
3 / 5 = 1 (0.600000)
4 / 1 = 4 (4.000000)
4 / 2 = 2 (2.000000)
4 / 3 = 2 (1.333333)
4 / 4 = 1 (1.000000)
4 / 5 = 1 (0.800000)
5 / 1 = 5 (5.000000)
5 / 2 = 3 (2.500000)
5 / 3 = 2 (1.666667)
5 / 4 = 2 (1.250000)
5 / 5 = 1 (1.000000)
Update
I posted an early answer to the original question, which works, but did not consider the efficiency of the algorithm, or that the divisor is always a power of 2. Performing two divisions was needlessly expensive.
I am using MSVC 32-bit compiler on a 64-bit system, so there is no chance that I can provide a best solution for the required target. But it is an interesting question so I have dabbled around to find that the best solution will discover the bit of the 2**n divisor. Using the library function log2 worked but was so slow. Doing my own shift was much better, but still my best C solution is
unsigned roundup(unsigned p, unsigned q)
{
return p / q + ((p & (q-1)) != 0);
}
My inline 32-bit assembler solution is faster, but of course that will not answer the question. I steal some cycles by assuming that eax is returned as the function value.
unsigned roundup(unsigned p, unsigned q)
{
__asm {
mov eax,p
mov edx,q
bsr ecx,edx ; cl = bit number of q
dec edx ; q-1
and edx,eax ; p & (q-1)
shr eax,cl ; divide p by q, a power of 2
sub edx,1 ; generate a carry when (p & (q-1)) == 0
cmc
adc eax,0 ; add 1 to result when (p & (q-1)) != 0
}
} ; eax returned as function value

This seems efficient and works for signed if your compiler is using arithmetic right shifts (usually true).
#include <stdio.h>
int main (void)
{
for (int i = -20; i <= 20; ++i) {
printf ("%5d %5d\n", i, ((i - 1) >> 1) + 1);
}
return 0;
}
Use >> 2 to divide by 4, >> 3 to divide by 8, &ct. Efficient lg does the work there.
You can even divide by 1! >> 0

Related

Manual vectorization using AVX vector intrinsics only runs about the same speed as 4 scalar FP adds on Ryzen?

So I decided to take a look at how to use SSE, AVX, ... in C via Intel® Intrinsics. Not because of any actual interest to use it for something, but out of pure curiosity. Trying to check if code using AVX is actually faster than non-AVX code, I was a bit surprised by the results. Here is my C code:
#include <stdio.h>
#include <stdlib.h>
#include <emmintrin.h>
#include <immintrin.h>
/*** Sum up two vectors using AVX ***/
#define __vec_sum_4d_d64(src_vec1, src_vec2, dst_vec) \
_mm256_store_pd(dst_vec, _mm256_add_pd(_mm256_load_pd(src_vec1), _mm256_load_pd(src_vec2)));
/*** Sum up two vectors without AVX ***/
#define __vec_sum_4d(src_vec1, src_vec2, dst_vec) \
dst_vec[0] = src_vec1[0] + src_vec2[0];\
dst_vec[1] = src_vec1[1] + src_vec2[1];\
dst_vec[2] = src_vec1[2] + src_vec2[2];\
dst_vec[3] = src_vec1[3] + src_vec2[3];
int main (int argc, char *argv[]) {
unsigned long i;
double dvec1[4] = {atof(argv[1]), atof(argv[2]), atof(argv[3]), atof(argv[4])};
double dvec2[4] = {atof(argv[5]), atof(argv[6]), atof(argv[7]), atof(argv[8])};
#if 1
for (i = 0; i < 3000000000; i++) {
__vec_sum_4d(dvec1, dvec2, dvec2);
}
#endif
#if 0
for (i = 0; i < 3000000000; i++) {
__vec_sum_4d_d64(dvec1, dvec2, dvec2);
}
#endif
printf("%10.10lf %10.10lf %10.10lf %10.10lf\n", dvec2[0], dvec2[1], dvec2[2], dvec2[3]);
}
I simply switch #if 1 to #if 0 and the other way around to switch between "modes" (AVX and non-AVX).
My expectation would be, that the loop using AVX would be at least somewhat faster than the other one, but it isn't. I compiled the code with gcc version 10.2.0 (GCC) and these: -O2 --std=gnu99 -lm -mavx2 flags.
> time ./noavx.x86_64 1 2 3 4 5 6 7 8
3000000005.0000000000 6000000006.0000000000 9000000007.0000000000 12000000008.0000000000
real 0m2.150s
user 0m2.147s
sys 0m0.000s
> time ./withavx.x86_64 1 2 3 4 5 6 7 8
3000000005.0000000000 6000000006.0000000000 9000000007.0000000000 12000000008.0000000000
real 0m2.168s
user 0m2.165s
sys 0m0.000s
As you can see, they run at practically the same speed. I also tried to increase the number of iterations by a factor of ten, but the results will simply scale up proportionally. Also note that the printed output values are the same for both executables, so I think that it is save to say that both perform the same calculations. Digging deeper i took a look at the assembly and was even more confused. Here are the important parts of both (only the loop):
; With avx
1070: c5 fd 58 c1 vaddpd %ymm1,%ymm0,%ymm0
1074: 48 83 e8 01 sub $0x1,%rax
1078: 75 f6 jne 1070
; Without avx
1080: c5 fb 58 c4 vaddsd %xmm4,%xmm0,%xmm0
1084: c5 f3 58 cd vaddsd %xmm5,%xmm1,%xmm1
1088: c5 eb 58 d7 vaddsd %xmm7,%xmm2,%xmm2
108c: c5 e3 58 de vaddsd %xmm6,%xmm3,%xmm3
1090: 48 83 e8 01 sub $0x1,%rax
1094: 75 ea jne 1080
In my understanding the second one should be way slower since besides decrementing the counter and the conditional jump there are four times as many instructions in it. Why is it not slower? Is the vaddsd instruction just four times faster than vaddpd?
If this is relevant, my system runs on a AMD Ryzen 5 2600X Six-Core Processor which supports AVX.

With AVX
; With avx
1070: c5 fd 58 c1 vaddpd %ymm1,%ymm0,%ymm0
1074: 48 83 e8 01 sub $0x1,%rax
1078: 75 f6 jne 1070
This loop is using ymm0 as accumulator. In other words it is doing ymm0 += ymm1 (this is a vector operation; adding 4 double values at once). Therefore it has loop-carried dependency on ymm0 (every new addition has to wait for the previous addition to finish and uses the result to start the next addition). vaddpd has latency=3, throughput=1 for Zen+ (according to https://www.uops.info/table.html). Loop carried dependency makes this loop bottleneck on latency of vaddpd, so your loop can get at best 3 cycles/iteration. Only one vaddpd addition is in-flight in the CPU, which is under-utilizing it's capability by a lot.
To make this faster add more accumulators (have more vectors to sum). It can (in theory) get 3 times faster due to pipelining (3 full ymm additions in-flight), as long as it does not get limited by something else.
Without AVX
; Without avx
1080: c5 fb 58 c4 vaddsd %xmm4,%xmm0,%xmm0
1084: c5 f3 58 cd vaddsd %xmm5,%xmm1,%xmm1
1088: c5 eb 58 d7 vaddsd %xmm7,%xmm2,%xmm2
108c: c5 e3 58 de vaddsd %xmm6,%xmm3,%xmm3
1090: 48 83 e8 01 sub $0x1,%rax
1094: 75 ea jne 1080
This loop accumulates results into 4 different accumulators. Basically it is doing:
xmm0 += xmm4
xmm1 += xmm5
xmm2 += xmm7
xmm3 += xmm6
All of these additions are independent from each other (and they are scalar additions, so each only operates on a single 64-bit floating point value). vaddsd has latency=3, throughput=0.5 (Cycles Per Instruction). Which means that it can start executing first 2 additions in one cycle. Then on the next cycle it will start the second pair of additions. Therefore it is possible to achieve 2 cycles/iteration for this loop based on throughput. But latency, as you recall is 3 cycles. So this loop is also bottlenecked on latency. Unroll once (with 4 additional accumulators; alternatively break loop-carried dep.chain within the loop by adding xmm4-7 between each other before adding it to the main accumulator) to get rid of that bottleneck (it may get ~50% faster).
Note that this ("without AVX") disassembly is still using VEX encoding, so technically still requires AVX-capable CPU.
On Benchmarking
Note that your disassembly does not have any loads or stores, so this may or may not be representative of performance comparison for adding 2 arrays of 4-double vectors.

You are dealing with a latency issue. Depending on the CPU you have to wait 3 or 4 cycles until you can use the result of a vaddpd or vaddsd instruction. But within 1 cycle up to 2 vaddpd or vaddsd instructions can be executed (if the CPU does not have to wait for source registers).
Since in your loop
; Without avx
1080: c5 fb 58 c4 vaddsd %xmm4,%xmm0,%xmm0
1084: c5 f3 58 cd vaddsd %xmm5,%xmm1,%xmm1
1088: c5 eb 58 d7 vaddsd %xmm7,%xmm2,%xmm2
108c: c5 e3 58 de vaddsd %xmm6,%xmm3,%xmm3
1090: 48 83 e8 01 sub $0x1,%rax
1094: 75 ea jne 1080
each vaddsd depends on the result from the previous iteration, it has to wait 3 or 4 cycles before this can be executed. But the execution of the all the vaddsd and the sub and jne can happen during that time. Therefore, for this simple loop it does not make a difference, if you execute one vaddpd or four vaddsd.
To fully exhaust the vaddpd instruction, you need to execute 6 or 8 of them which do not depend on the result of each other (or have other instructions which do some independent work).

I am Using PSoC 5. I want to know about how to convert 8-Byte hexadecimal into decimal after reading from EEPROM

EEPROM Data:
0000: 88 77 66 55 44 33 22 11 00 00 00 00 00 00 00 00
0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
I am saving the result after reading 0th row of EEPROM in array
Ex - Uint8 EEPROM_res[8];
EEPROM_res = {88, 77, 66, 55, 44, 33, 22, 11};
I want to convert HexaDecimal(0x8877665544332211) into decimal (9833440827789222417) and save the decimal value into integer data type for further comparison. What is the easiest way of conversion of 8-Byte Hexadecimal?

Can you share the algorithm? – Shivangi Kishore
Converting base 10 (seconds) to base 60 (hours:minutes:seconds)
4321 seconds (in base 10) to base 60.
60^0 = 1
60^1 = 60
60^2 = 3600
60^3 = 216000
(just like 10^0 = 1, 10^1 = 10 and 10^2 = 100 ... base 10, 2^0 = 1, 2^1 = 2, 2^2 = 4 and so on base 2)
So 4321 is less than 216000 but greater than 3600 so we can shortcut and start there
4321 / 3600 = 1 remainder 721
721 / 60 = 12 remainder 1
So 4321 base 10 converted to base 60 (using base 10 to do the math) is 01:12:01
base 2 to base 10 using a base 2 computer is no different.
10 factors into 2 and 5, 2 factors into 2, so you cannot do base 8 (octal) to base 2 shortcuts nor can you do base 16 (hex) to base 2 shortcuts. Have to do it the long way.
EDIT
Another approach that may be more useful to you is to work from the other end. Same math just done using remainders instead of results. Makes for an easier algorithm to program.
4321 / 60 = 72 remainder 1
72 / 60 = 1 remainder 12
1 / 60 = 0 remainder 1
conversion to base 60: 01:12:01
1234 / 10 = 123 remainder 4
123 / 10 = 12 remainder 3
12 / 10 = 1 remainder 2
1 / 10 = 0 remainder 1
conversion to base 10: 1234
Long division in binary is the same but simpler than in a base greater than 2 because the divisor on each step through the denominator can either go into the test value 0 times or 1 time. binary...base 2...
Also if you think through long division (254 / 5 or 0xFE / 0x5)
------------
101 ) 11111110
this is the first test cases that is non-zero
001
------------
101 ) 11111110
101
and you keep going
001
------------
101 ) 11111110
101
---
10
and
0011
------------
101 ) 11111110
101
---
101
101
---
0
and
00110010
------------
101 ) 11111110
101
---
101
101
---
0111
101
---
100
and so 0xFE / 5 = 0x32 remainder 4, but the key here is that I could
do that in hardware with a hardware divide instruction if I have say an 8 bit divide instruction and want to divide an infinitely long number.
If my next (let's say) four digits were 1010:
0001110
101 1001010
101
===
1000
101
===
111
101
===
100
0xFEA / 5 = 0x32E remainder 4
So now I have divided a 12 bit number using an 8 bit divider instruction and I can do this all day long until I run out of ram. 8 bits, 88 bits, 888 bits, 8888 bits, a million bits divided by a small number like 5 or 10.
Or if you keep working on this you find that compilers often use multiply we also know from grade school (since all of this problem is solved with grade school math).
x / 10 = x * (1/10)
More likely to find a hardware multiply than a divide and the multiply is often fewer clocks, etc.
unsigned int fun ( unsigned int x )
{
return(x/10);
}
00000000 <fun>:
0: e59f3008 ldr r3, [pc, #8] ; 10 <fun+0x10>
4: e0802093 umull r2, r0, r3, r0
8: e1a001a0 lsr r0, r0, #3
c: e12fff1e bx lr
10: cccccccd stclgt 12, cr12, [r12], {205} ; 0xcd
0000000000000000 <fun>:
0: 89 f8 mov %edi,%eax
2: ba cd cc cc cc mov $0xcccccccd,%edx
7: f7 e2 mul %edx
9: 89 d0 mov %edx,%eax
b: c1 e8 03 shr $0x3,%eax
e: c3 retq
and other instruction sets, the compiler multiplies by 1/5 then compensates (base 10, 10 factors to 2 and 5, base 2, 2 factors to 2 a common factor).
But if your hardware doesn't have a multiply or divide the compiler should still handle the basic C language variable types, long, int, short, char. And you can cascade those all day long.
unsigned int fun ( unsigned int x )
{
unsigned int ra;
unsigned int rb;
unsigned int rc;
ra=((x>>4)&0xFF)/5;
rb=((x>>4)&0xFF)%5;
rb=(rb<<4)|(x&0xF);
rc=rb/5;
ra=(ra<<4)|rc;
return(ra);
}
test it on the development machine
#include <stdio.h>
extern unsigned int fun ( unsigned int );
int main ( void )
{
printf("%X\n",fun(0xFEA));
return(0);
}
and the output is 0x32E.
And that really completes it everything you need to know (well you already knew from grade school) to do the conversion with the tools you have available.
If instead you are looking for some big math library for some compiler for some target, having us google things for you is not a Stack Overflow question and should be closed as seeking external or third party libraries.
Now as pointed out
save the decimal value into integer data type for further comparison
makes no sense whatsoever, if you want to take some number and then save it for further comparison on a computer, that function looks like this
void fun ( void )
{
}
It is already in that form you want it to be a integer that means some variable (larger than C supports so that is yet another problem with the wording of the question) so that means binary not decimal, so it is already in a future comparable integer form.
If you want to represent that number visually (as in a human viewable printout) in some base then you need to convert that into something that can be viewed be it base 2 (binary), base 8 (octal), base 16 (hex), base 10 (decimal) and so on.
The bits 11111111 in the computer if I want to see those in binary then
"11111111" in octal "377" in hex "FF" in decimal "255" all of which require an algorithm to convert. Octal and hex of course being the simplest, don't need to use a division routine to convert to octal, base 8, factors are 222 base 2 factors are 2 so 2^3 vs 2^1
11111111 / 8 = 11111111 >> 3 = 11111 r 111
11111 / 8 = 11111 >> 3 = 11 r 111
11 / 8 = 11 >> 3 = 0 r 11
377
Base 10 though you have to go the long way and actually do the division and find the remainder until the result of the division in the loop is 0.
10 has factors 2 and 5, 2 has factors 2 you can't shift your way through it. Base 100, 10*10, and base 10 you can shift your way through (just like base 2 to base 4) but base 10 from base 2, can't.
11111111 / 10 = 11001 r 101
11001 / 10 = 10 r 101
10 / 10 = 0 r 10
255
Which of course is why we greatly prefer to view stuff on the computer in hex rather than decimal.
Once in decimal though
"for further comparison"
once you get it to base 10 then the only reasonable comparison you can do with other base 10 numbers is a string compare or an array compare, from the above example the two more common ways you would store that conversion is 0x32, 0x35, 0x35, 0x00 or 0x02, 0x05, 0x05 with some length knowledge.
You can't do greater than less than without a whole lot of work. Equal vs not equal you could do in base 10 bit it is not in integer form.
So your question doesn't make any sense.
Also assume this is a multi part typo:
EEPROM_res = {88, 77, 66, 55, 44, 33, 22, 11};
which is the same as
EEPROM_res = {0x58,0x4D,0x42,0x37,0x2C,0x21,0x16,0x0B};
Neither of which are
EEPROM_res = {0x88,0x77,0x66,0x55,0x44,0x33,0x22,0x11};
Which is what your first 8 bytes of eeprom dump showed in hexadecimal as you mentioned and is somewhat obvious.
Nor are they
EEPROM_res[19] = {0x39,0x38,0x33,0x33....and so on
or
EEPROM_res[19] = {0x09,0x08,0x03,0x03....and so on
the decimal value you computed somehow: 9833440827789222417

How do AVR Assembly BRNE delay loops work?

An online delay loop generator gives me this delay loop of runtime of 0.5s for a chip running at 16MHz.
The questions on my mind are:
Do the branches keep branching if the register becomes negative?
How exactly does one calculate the values that are loaded in the beginning?
ldi r18, 41
ldi r19, 150
ldi r20, 128
L1: dec r20
brne L1
dec r19
brne L1
dec r18
brne L1

To answer your questions exactly:
1: The DEC instruction doesn't know about 'signed' numbers, it just decrements an 8-bit register. The miracle of twos complement arithmetic makes this work at the wraparound (0x00 -> 0xFF, is the same bit pattern as 0 -> -1). The DEC instruction also sets the Z flag in the status register, which BRNE uses to determine if branching should happen.
2: You can see from the AVR manual that DEC is a single cycle instruction. BRNE is also a single cycle when not branching, and 2 cycles when branching. therefore to compute the time of your loop, you need to count the number of times each path will be taken.
Consider a single DEC/BRNE loop:
ldi r8 0
L1: dec r8
brne L1
This loop will execute exactly 256 times, which is 256 cycles of DEC, and 512 cycles of BRNE, for a total of 768 cycles. At 16MHz, that's 48us.
Wrapping that in an outer delay loop:
ldi r7 10
ldi r8 0
L1: dec r8
brne L1
dec r7
brne L1
You can see that the outer loop counter will decrement every time the inner loop counter hits 0. Thus in our example the outer loop DEC/BRNE will happen 10 times(for 768 cycles), and the inner loop will happen 10 x 256 times so the total time for this loop is 10 x 48us + 48us for 528us. Similarly for 3 nested loops.
From here, it's trivial to figure out how many times each loop should execute to achieve the desired delay. It's the largest number of iterations the outer loop can do less than the desired time, then taking that time out, do the same for the next nested loop, and so on until the inner most loop fills up the tiny amount left.

How exactly does one calculate the values that are loaded in the beginning?
Calculate total amount of cycles => 0.5s * 16000000 = 8000000
Know the total cycles of r20 and r19 loops (from zero to zero), AVR registers are 8 bit, so a full loop is 256 times (dec 0 = 255). dec is 1 cycle. brne is 2 cycles when condition (branch) happens, 1 cycle when not.
So the most inner loop:
L1: dec r20
brne L1
Is from zero to zero (r20=0): 255 * (1+2) + 1 * (1+1) = 767 cycles (255 times the branch is taken, 1 time it goes through).
The second wrapping loop working with r19 is then: 255 * (767+1+2) + 1 * (767+1+1) = 197119 cycles
The single r18 loop when branch is taken is then 197119+1+2 = 197122 cycles. (197121 when branch is not taken = final exit of delay loop, I will avoid this -1 by a trick in next step).
Now this is almost enough to calculate initial r18, let's adjust the total cycles first by the O(1) code, that's three times ldi instruction, which takes 1 cycle: total2 = 8000000 - (1+1+1) + 1 = 7999998 ... wait, what is the last +1 there? That's fake additional cycle to delay, to make the final r18 loop pretend it costs same as non-final, i.e. 197122 cycles.
And that's it, the initial r18 must be enough to wait at least 7999998 cycles: r18 = (7999998 + 197122 - 1) div 197122 = 41. The " + 197122 - 1" part will make sure the abundant cycles fits constraint: 0 <= abundant_cycles < 197122 (remainder by 197122 division).
41 * 197122 = 8082002 ... this is too much, but now we can shave the extra cycles down by setting up also r19 and r20 to particular values, to fine-tuned the delay. So how much is to be shaved off? 8082002 - 7999998 = 82004 cycles.
The single r19 loop takes 770 cycles when branching and 769 when exiting, so again let's avoid the 769 by adjusting 82004 to only 82003 to be shaved off. 82003 div 770 = 106: 106 r19 loops can be skipped, r19 = 256 - 106 = 150. Now this will shave 81620 cycles, so 82003 - 81620 = 383 cycles more to be shaved off.
The single r20 loop takes 3 cycles when branching and 2 when exiting. Again I will take into account the exiting loop being only 2 cycles -> 383 => 382 to shave off. And 382 div 3 = 127, remainder 1. r20 = 256 - 127 = 129 and do one less to shave additional 3 cycles (to cover that remainder) = 128. Then 2 cycles (3-1) wait is missing to make it a full 8mil.
So:
ldi r18, 41
ldi r19, 150
ldi r20, 128
L1: dec r20
brne L1
dec r19
brne L1
dec r18
brne L1
According to my calculations should wait exactly 8000000-2 cycles (if not interrupted by something else).
Let's try to verify:
Initial r20: 1273 + 12 = 383 cycles
Initial r19: 1*(383+1+2) + 148*(767+1+2) + 1*(767+1+1) = 115115 cycles
(that's initial r20 incomplete cycle one time, then 149 times full time r20 cycle with the final one being -1 due to exiting brne)
The r18 total: 1*(115115+1+2) + 39*(197119+1+2) + 1*(197119+1+1) = 7999997 cycles.
And the three ldi are +3 cycles = 7999997+3 = 8000000.
And the missing 2 cycles are nowhere to be seen, so I made somewhere a mistake.
As you can see, the math behind is reasonably simple, but very mundane to do by hand, and prone to mistakes...
Ah, I think I know where I did the mistake. When I'm shaving off the abundant cycles, the termination loop is not involved (that's part of the actual delay process), so I shouldn't have adjusted the to_shave_off cycles by -1. Then After r19 = 106 I would have still to shave off 384 cycles, and that's exactly 384/3 = 128 loops to shave off from r20 = 256-128 = 128. No remainder, no missing cycle, perfect 8mil.
If you have trouble to follow this reverse calculation, try it other way, imagine 2 bit registers (0..3 values only), and do on paper similar loop with r18=r19=r20=2, and count the cycles manually to see how it is evolving. .. i.e. 3x ldi = +3, dec r20,brne,dec r20,brne(skip) = +5 cycles, dec r19, brne = +3, ... etc.
Edit: and this was explained before by Jester in his links. And I'm too lazy to clean this up down to some simple formula to create your own online calculator.

Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)

I'm a newbie at instruction optimization.
I did a simple analysis on a simple function dotp which is used to get the dot product of two float arrays.
The C code is as follows:
float dotp(
const float x[],
const float y[],
const short n
)
{
short i;
float suma;
suma = 0.0f;
for(i=0; i<n; i++)
{
suma += x[i] * y[i];
}
return suma;
}
I use the test frame provided by Agner Fog on the web testp.
The arrays which are used in this case are aligned:
int n = 2048;
float* z2 = (float*)_mm_malloc(sizeof(float)*n, 64);
char *mem = (char*)_mm_malloc(1<<18,4096);
char *a = mem;
char *b = a+n*sizeof(float);
char *c = b+n*sizeof(float);
float *x = (float*)a;
float *y = (float*)b;
float *z = (float*)c;
Then I call the function dotp, n=2048, repeat=100000:
for (i = 0; i < repeat; i++)
{
sum = dotp(x,y,n);
}
I compile it with gcc 4.8.3, with the compile option -O3.
I compile this application on a computer which does not support FMA instructions, so you can see there are only SSE instructions.
The assembly code:
.L13:
movss xmm1, DWORD PTR [rdi+rax*4]
mulss xmm1, DWORD PTR [rsi+rax*4]
add rax, 1
cmp cx, ax
addss xmm0, xmm1
jg .L13
I do some analysis:
μops-fused la 0 1 2 3 4 5 6 7
movss 1 3 0.5 0.5
mulss 1 5 0.5 0.5 0.5 0.5
add 1 1 0.25 0.25 0.25 0.25
cmp 1 1 0.25 0.25 0.25 0.25
addss 1 3 1
jg 1 1 1 -----------------------------------------------------------------------------
total 6 5 1 2 1 1 0.5 1.5
After running, we get the result:
Clock | Core cyc | Instruct | BrTaken | uop p0 | uop p1
--------------------------------------------------------------------
542177906 |609942404 |1230100389 |205000027 |261069369 |205511063
--------------------------------------------------------------------
2.64 | 2.97 | 6.00 | 1 | 1.27 | 1.00
uop p2 | uop p3 | uop p4 | uop p5 | uop p6 | uop p7
-----------------------------------------------------------------------
205185258 | 205188997 | 100833 | 245370353 | 313581694 | 844
-----------------------------------------------------------------------
1.00 | 1.00 | 0.00 | 1.19 | 1.52 | 0.00
The second line is the value read from the Intel registers; the third line is divided by the branch number, "BrTaken".
So we can see, in the loop there are 6 instructions, 7 uops, in agreement with the analysis.
The numbers of uops run in port0 port1 port 5 port6 are similar to what the analysis says. I think maybe the uops scheduler does this, it may try to balance loads on the ports, am I right?
I absolutely do not understand know why there are only about 3 cycles per loop. According to Agner's instruction table, the latency of instruction mulss is 5, and there are dependencies between the loops, so as far as I see it should take at least 5 cycles per loop.
Could anyone shed some insight?
==================================================================
I tried to write an optimized version of this function in nasm, unrolling the loop by a factor of 8 and using the vfmadd231ps instruction:
.L2:
vmovaps ymm1, [rdi+rax]
vfmadd231ps ymm0, ymm1, [rsi+rax]
vmovaps ymm2, [rdi+rax+32]
vfmadd231ps ymm3, ymm2, [rsi+rax+32]
vmovaps ymm4, [rdi+rax+64]
vfmadd231ps ymm5, ymm4, [rsi+rax+64]
vmovaps ymm6, [rdi+rax+96]
vfmadd231ps ymm7, ymm6, [rsi+rax+96]
vmovaps ymm8, [rdi+rax+128]
vfmadd231ps ymm9, ymm8, [rsi+rax+128]
vmovaps ymm10, [rdi+rax+160]
vfmadd231ps ymm11, ymm10, [rsi+rax+160]
vmovaps ymm12, [rdi+rax+192]
vfmadd231ps ymm13, ymm12, [rsi+rax+192]
vmovaps ymm14, [rdi+rax+224]
vfmadd231ps ymm15, ymm14, [rsi+rax+224]
add rax, 256
jne .L2
The result:
Clock | Core cyc | Instruct | BrTaken | uop p0 | uop p1
------------------------------------------------------------------------
24371315 | 27477805| 59400061 | 3200001 | 14679543 | 11011601
------------------------------------------------------------------------
7.62 | 8.59 | 18.56 | 1 | 4.59 | 3.44
uop p2 | uop p3 | uop p4 | uop p5 | uop p6 | uop p7
-------------------------------------------------------------------------
25960380 |26000252 | 47 | 537 | 3301043 | 10
------------------------------------------------------------------------------
8.11 |8.13 | 0.00 | 0.00 | 1.03 | 0.00
So we can see the L1 data cache reach 2*256bit/8.59, it is very near to the peak 2*256/8, the usage is about 93%, the FMA unit only used 8/8.59, the peak is 2*8/8, the usage is 47%.
So I think I've reached the L1D bottleneck as Peter Cordes expects.
==================================================================
Special thanks to Boann, fix so many grammatical errors in my question.
=================================================================
From Peter's reply, I get it that only "read and written" register would be the dependence, "writer-only" registers would not be the dependence.
So I try to reduce the registers used in loop, and I try to unrolling by 5, if everything is ok, I should meet the same bottleneck, L1D.
.L2:
vmovaps ymm0, [rdi+rax]
vfmadd231ps ymm1, ymm0, [rsi+rax]
vmovaps ymm0, [rdi+rax+32]
vfmadd231ps ymm2, ymm0, [rsi+rax+32]
vmovaps ymm0, [rdi+rax+64]
vfmadd231ps ymm3, ymm0, [rsi+rax+64]
vmovaps ymm0, [rdi+rax+96]
vfmadd231ps ymm4, ymm0, [rsi+rax+96]
vmovaps ymm0, [rdi+rax+128]
vfmadd231ps ymm5, ymm0, [rsi+rax+128]
add rax, 160 ;n = n+32
jne .L2
The result:
Clock | Core cyc | Instruct | BrTaken | uop p0 | uop p1
------------------------------------------------------------------------
25332590 | 28547345 | 63700051 | 5100001 | 14951738 | 10549694
------------------------------------------------------------------------
4.97 | 5.60 | 12.49 | 1 | 2.93 | 2.07
uop p2 |uop p3 | uop p4 | uop p5 |uop p6 | uop p7
------------------------------------------------------------------------------
25900132 |25900132 | 50 | 683 | 5400909 | 9
-------------------------------------------------------------------------------
5.08 |5.08 | 0.00 | 0.00 |1.06 | 0.00
We can see 5/5.60 = 89.45%, it is a little smaller than urolling by 8, is there something wrong?
=================================================================
I try to unroll loop by 6, 7 and 15, to see the result.
I also unroll by 5 and 8 again, to double confirm the result.
The result is as follow, we can see this time the result is much better than before.
Although the result is not stable, the unrolling factor is bigger and the result is better.
| L1D bandwidth | CodeMiss | L1D Miss | L2 Miss
----------------------------------------------------------------------------
unroll5 | 91.86% ~ 91.94% | 3~33 | 272~888 | 17~223
--------------------------------------------------------------------------
unroll6 | 92.93% ~ 93.00% | 4~30 | 481~1432 | 26~213
--------------------------------------------------------------------------
unroll7 | 92.29% ~ 92.65% | 5~28 | 336~1736 | 14~257
--------------------------------------------------------------------------
unroll8 | 95.10% ~ 97.68% | 4~23 | 363~780 | 42~132
--------------------------------------------------------------------------
unroll15 | 97.95% ~ 98.16% | 5~28 | 651~1295 | 29~68
=====================================================================
I try to compile the function with gcc 7.1 in the web "https://gcc.godbolt.org"
The compile option is "-O3 -march=haswell -mtune=intel", that is similar to gcc 4.8.3.
.L3:
vmovss xmm1, DWORD PTR [rdi+rax]
vfmadd231ss xmm0, xmm1, DWORD PTR [rsi+rax]
add rax, 4
cmp rdx, rax
jne .L3
ret

Related:
AVX2: Computing dot product of 512 float arrays has a good manually-vectorized dot-product loop using multiple accumulators with FMA intrinsics. The rest of the answer explains why that's a good thing, with cpu-architecture / asm details.
Dot Product of Vectors with SIMD shows that with the right compiler options, some compilers will auto-vectorize that way.
Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell another version of this Q&A with more focus on unrolling to hide latency (and bottleneck on throughput), less background on what that even means. And with examples using C intrinsics.
Latency bounds and throughput bounds for processors for operations that must occur in sequence - a textbook exercise on dependency chains, with two interlocking chains, one reading from earlier in the other.
Look at your loop again: movss xmm1, src has no dependency on the old value of xmm1, because its destination is write-only. Each iteration's mulss is independent. Out-of-order execution can and does exploit that instruction-level parallelism, so you definitely don't bottleneck on mulss latency.
Optional reading: In computer architecture terms: register renaming avoids the WAR anti-dependency data hazard of reusing the same architectural register. (Some pipelining + dependency-tracking schemes before register renaming didn't solve all the problems, so the field of computer architecture makes a big deal out of different kinds of data hazards.
Register renaming with Tomasulo's algorithm makes everything go away except the actual true dependencies (read after write), so any instruction where the destination is not also a source register has no interaction with the dependency chain involving the old value of that register. (Except for false dependencies, like popcnt on Intel CPUs, and writing only part of a register without clearing the rest (like mov al, 5 or sqrtss xmm2, xmm1). Related: Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?).
Back to your code:
.L13:
movss xmm1, DWORD PTR [rdi+rax*4]
mulss xmm1, DWORD PTR [rsi+rax*4]
add rax, 1
cmp cx, ax
addss xmm0, xmm1
jg .L13
The loop-carried dependencies (from one iteration to the next) are each:
xmm0, read and written by addss xmm0, xmm1, which has 3 cycle latency on Haswell.
rax, read and written by add rax, 1. 1c latency, so it's not the critical-path.
It looks like you measured the execution time / cycle-count correctly, because the loop bottlenecks on the 3c addss latency.
This is expected: the serial dependency in a dot product is the addition into a single sum (aka the reduction), not the multiplies between vector elements. (Unrolling with multiple sum accumulator variables / registers can hide that latency.)
That is by far the dominant bottleneck for this loop, despite various minor inefficiencies:
short i produced the silly cmp cx, ax, which takes an extra operand-size prefix. Luckily, gcc managed to avoid actually doing add ax, 1, because signed-overflow is Undefined Behaviour in C. So the optimizer can assume it doesn't happen. (update: integer promotion rules make it different for short, so UB doesn't come into it, but gcc can still legally optimize. Pretty wacky stuff.)
If you'd compiled with -mtune=intel, or better, -march=haswell, gcc would have put the cmp and jg next to each other where they could macro-fuse.
I'm not sure why you have a * in your table on the cmp and add instructions. (update: I was purely guessing that you were using a notation like IACA does, but apparently you weren't). Neither of them fuse. The only fusion happening is micro-fusion of mulss xmm1, [rsi+rax*4].
And since it's a 2-operand ALU instruction with a read-modify-write destination register, it stays macro-fused even in the ROB on Haswell. (Sandybridge would un-laminate it at issue time.) Note that vmulss xmm1, xmm1, [rsi+rax*4] would un-laminate on Haswell, too.
None of this really matters, since you just totally bottleneck on FP-add latency, much slower than any uop-throughput limits. Without -ffast-math, there's nothing compilers can do. With -ffast-math, clang will usually unroll with multiple accumulators, and it will auto-vectorize so they will be vector accumulators. So you can probably saturate Haswell's throughput limit of 1 vector or scalar FP add per clock, if you hit in L1D cache.
With FMA being 5c latency and 0.5c throughput on Haswell, you would need 10 accumulators to keep 10 FMAs in flight and max out FMA throughput by keeping p0/p1 saturated with FMAs. (Skylake reduced FMA latency to 4 cycles, and runs multiply, add, and FMA on the FMA units. So it actually has higher add latency than Haswell.)
(You're bottlenecked on loads, because you need two loads for every FMA. In other cases, you can actually gain add throughput by replacing some a vaddps instruction with an FMA with a multiplier of 1.0. This means more latency to hide, so it's best in a more complex algorithm where you have an add that's not on the critical path in the first place.)
Re: uops per port:
there are 1.19 uops per loop in the port 5, it is much more than expect 0.5, is it the matter about the uops dispatcher trying to make uops on every port same
Yes, something like that.
The uops are not assigned randomly, or somehow evenly distributed across every port they could run on. You assumed that the add and cmp uops would distribute evenly across p0156, but that's not the case.
The issue stage assigns uops to ports based on how many uops are already waiting for that port. Since addss can only run on p1 (and it's the loop bottleneck), there are usually a lot of p1 uops issued but not executed. So few other uops will ever be scheduled to port1. (This includes mulss: most of the mulss uops will end up scheduled to port 0.)
Taken-branches can only run on port 6. Port 5 doesn't have any uops in this loop that can only run there, so it ends up attracting a lot of the many-port uops.
The scheduler (which picks unfused-domain uops out of the Reservation Station) isn't smart enough to run critical-path-first, so this is assignment algorithm reduces resource-conflict latency (other uops stealing port1 on cycles when an addss could have run). It's also useful in cases where you bottleneck on the throughput of a given port.
Scheduling of already-assigned uops is normally oldest-ready first, as I understand it. This simple algorithm is hardly surprising, since it has to pick a uop with its inputs ready for each port from a 60-entry RS every clock cycle, without melting your CPU. The out-of-order machinery that finds and exploits the ILP is one of the significant power costs in a modern CPU, comparable to the execution units that do the actual work.
Related / more details: How are x86 uops scheduled, exactly?
More performance analysis stuff:
Other than cache misses / branch mispredicts, the three main possible bottlenecks for CPU-bound loops are:
dependency chains (like in this case)
front-end throughput (max of 4 fused-domain uops issued per clock on Haswell)
execution port bottlenecks, like if lots of uops need p0/p1, or p2/p3, like in your unrolled loop. Count unfused-domain uops for specific ports. Generally you can assuming best-case distribution, with uops that can run on other ports not stealing the busy ports very often, but it does happen some.
A loop body or short block of code can be approximately characterized by 3 things: fused-domain uop count, unfused-domain count of which execution units it can run on, and total critical-path latency assuming best-case scheduling for its critical path. (Or latencies from each of input A/B/C to the output...)
For example of doing all three to compare a few short sequences, see my answer on What is the efficient way to count set bits at a position or lower?
For short loops, modern CPUs have enough out-of-order execution resources (physical register file size so renaming doesn't run out of registers, ROB size) to have enough iterations of a loop in-flight to find all the parallelism. But as dependency chains within loops get longer, eventually they run out. See Measuring Reorder Buffer Capacity for some details on what happens when a CPU runs out of registers to rename onto.
See also lots of performance and reference links in the x86 tag wiki.
Tuning your FMA loop:
Yes, dot-product on Haswell will bottleneck on L1D throughput at only half the throughput of the FMA units, since it takes two loads per multiply+add.
If you were doing B[i] = x * A[i] + y; or sum(A[i]^2), you could saturate FMA throughput.
It looks like you're still trying to avoid register reuse even in write-only cases like the destination of a vmovaps load, so you ran out of registers after unrolling by 8. That's fine, but could matter for other cases.
Also, using ymm8-15 can slightly increase code-size if it means a 3-byte VEX prefix is needed instead of 2-byte. Fun fact: vpxor ymm7,ymm7,ymm8 needs a 3-byte VEX while vpxor ymm8,ymm8,ymm7 only needs a 2-byte VEX prefix. For commutative ops, sort source regs from high to low.
Our load bottleneck means the best-case FMA throughput is half the max, so we need at least 5 vector accumulators to hide their latency. 8 is good, so there's plenty of slack in the dependency chains to let them catch up after any delays from unexpected latency or competition for p0/p1. 7 or maybe even 6 would be fine, too: your unroll factor doesn't have to be a power of 2.
Unrolling by exactly 5 would mean that you're also right at the bottleneck for dependency chains. Any time an FMA doesn't run in the exact cycle its input is ready means a lost cycle in that dependency chain. This can happen if a load is slow (e.g. it misses in L1 cache and has to wait for L2), or if loads complete out of order and an FMA from another dependency chain steals the port this FMA was scheduled for. (Remember that scheduling happens at issue time, so the uops sitting in the scheduler are either port0 FMA or port1 FMA, not an FMA that can take whichever port is idle).
If you leave some slack in the dependency chains, out-of-order execution can "catch up" on the FMAs, because they won't be bottlenecked on throughput or latency, just waiting for load results. #Forward found (in an update to the question) that unrolling by 5 reduced performance from 93% of L1D throughput to 89.5% for this loop.
My guess is that unroll by 6 (one more than the minimum to hide the latency) would be ok here, and get about the same performance as unroll by 8. If we were closer to maxing out FMA throughput (rather than just bottlenecked on load throughput), one more than the minimum might not be enough.
update: #Forward's experimental test shows my guess was wrong. There isn't a big difference between unroll5 and unroll6. Also, unroll15 is twice as close as unroll8 to the theoretical max throughput of 2x 256b loads per clock. Measuring with just independent loads in the loop, or with independent loads and register-only FMA, would tell us how much of that is due to interaction with the FMA dependency chain. Even the best case won't get perfect 100% throughput, if only because of measurement errors and disruption due to timer interrupts. (Linux perf measures only user-space cycles unless you run it as root, but time still includes time spent in interrupt handlers. This is why your CPU frequency might be reported as 3.87GHz when run as non-root, but 3.900GHz when run as root and measuring cycles instead of cycles:u.)
We aren't bottlenecked on front-end throughput, but we can reduce the fused-domain uop count by avoiding indexed addressing modes for non-mov instructions. Fewer is better and makes this more hyperthreading-friendly when sharing a core with something other than this.
The simple way is just to do two pointer-increments inside the loop. The complicated way is a neat trick of indexing one array relative to the other:
;; input pointers for x[] and y[] in rdi and rsi
;; size_t n in rdx
;;; zero ymm1..8, or load+vmulps into them
add rdx, rsi ; end_y
; lea rdx, [rdx+rsi-252] to break out of the unrolled loop before going off the end, with odd n
sub rdi, rsi ; index x[] relative to y[], saving one pointer increment
.unroll8:
vmovaps ymm0, [rdi+rsi] ; *px, actually py[xy_offset]
vfmadd231ps ymm1, ymm0, [rsi] ; *py
vmovaps ymm0, [rdi+rsi+32] ; write-only reuse of ymm0
vfmadd231ps ymm2, ymm0, [rsi+32]
vmovaps ymm0, [rdi+rsi+64]
vfmadd231ps ymm3, ymm0, [rsi+64]
vmovaps ymm0, [rdi+rsi+96]
vfmadd231ps ymm4, ymm0, [rsi+96]
add rsi, 256 ; pointer-increment here
; so the following instructions can still use disp8 in their addressing modes: [-128 .. +127] instead of disp32
; smaller code-size helps in the big picture, but not for a micro-benchmark
vmovaps ymm0, [rdi+rsi+128-256] ; be pedantic in the source about compensating for the pointer-increment
vfmadd231ps ymm5, ymm0, [rsi+128-256]
vmovaps ymm0, [rdi+rsi+160-256]
vfmadd231ps ymm6, ymm0, [rsi+160-256]
vmovaps ymm0, [rdi+rsi-64] ; or not
vfmadd231ps ymm7, ymm0, [rsi-64]
vmovaps ymm0, [rdi+rsi-32]
vfmadd231ps ymm8, ymm0, [rsi-32]
cmp rsi, rdx
jb .unroll8 ; } while(py < endy);
Using a non-indexed addressing mode as the memory operand for vfmaddps lets it stay micro-fused in the out-of-order core, instead of being un-laminated at issue. Micro fusion and addressing modes
So my loop is 18 fused-domain uops for 8 vectors. Yours takes 3 fused-domain uops for each vmovaps + vfmaddps pair, instead of 2, because of un-lamination of indexed addressing modes. Both of them still of course have 2 unfused-domain load uops (port2/3) per pair, so that's still the bottleneck.
Fewer fused-domain uops lets out-of-order execution see more iterations ahead, potentially helping it absorb cache misses better. It's a minor thing when we're bottlenecked on an execution unit (load uops in this case) even with no cache misses, though. But with hyperthreading, you only get every other cycle of front-end issue bandwidth unless the other thread is stalled. If it's not competing too much for load and p0/1, fewer fused-domain uops will let this loop run faster while sharing a core. (e.g. maybe the other hyper-thread is running a lot of port5 / port6 and store uops?)
Since un-lamination happens after the uop-cache, your version doesn't take extra space in the uop cache. A disp32 with each uop is ok, and doesn't take extra space. But bulkier code-size means the uop-cache is less likely to pack as efficiently, since you'll hit 32B boundaries before uop cache lines are full more often. (Actually, smaller code doesn't guarantee better either. Smaller instructions could lead to filling a uop cache line and needing one entry in another line before crossing a 32B boundary.) This small loop can run from the loopback buffer (LSD), so fortunately the uop-cache isn't a factor.
Then after the loop: Efficient cleanup is the hard part of efficient vectorization for small arrays that might not be a multiple of the unroll factor or especially the vector width
...
jb
;; If `n` might not be a multiple of 4x 8 floats, put cleanup code here
;; to do the last few ymm or xmm vectors, then scalar or an unaligned last vector + mask.
; reduce down to a single vector, with a tree of dependencies
vaddps ymm1, ymm2, ymm1
vaddps ymm3, ymm4, ymm3
vaddps ymm5, ymm6, ymm5
vaddps ymm7, ymm8, ymm7
vaddps ymm0, ymm3, ymm1
vaddps ymm1, ymm7, ymm5
vaddps ymm0, ymm1, ymm0
; horizontal within that vector, low_half += high_half until we're down to 1
vextractf128 xmm1, ymm0, 1
vaddps xmm0, xmm0, xmm1
vmovhlps xmm1, xmm0, xmm0
vaddps xmm0, xmm0, xmm1
vmovshdup xmm1, xmm0
vaddss xmm0, xmm1
; this is faster than 2x vhaddps
vzeroupper ; important if returning to non-AVX-aware code after using ymm regs.
ret ; with the scalar result in xmm0
For more about the horizontal sum at the end, see Fastest way to do horizontal SSE vector sum (or other reduction). The two 128b shuffles I used don't even need an immediate control byte, so it saves 2 bytes of code size vs. the more obvious shufps. (And 4 bytes of code-size vs. vpermilps, because that opcode always needs a 3-byte VEX prefix as well as an immediate). AVX 3-operand stuff is very nice compared the SSE, especially when writing in C with intrinsics so you can't as easily pick a cold register to movhlps into.

What's the proper implementation for hardware emulation?

I'm going to programme a Game Boy emulator (Z80 is the CPU in case somebody is not familiar with it), and while I was doing my research, I've found some things I'm not so sure about.
The first one was that C is the programming language to choose here. That's not so much of a problem, but I'd like to hear your opinion from today's point of view. Even C++ was not recommended.
The second thing I found out was that everybody was using one function per opcode. That seems logical since it's just one function call and probably better optimised than having one function for the "ADD" instruction and then you've got to find out what registers are used here. But how necessary is that today? Is it something I should stick to or should I rather rewrite my emulator if I notice that another way which might be more convenient just doesn't cut it (more or less modern gaming consoles pop into my mind right now)?
Also, it's kind of demotivating to write a function for "add that register to this register" over and over again. Is there a way to automate that from an opcode map or something like that?

I mostly agree with WingsOfIcarus. I wrote a few emulators already so here is my insight:
The use of function pointers is a good idea (for speed and clarity of code)
OOP is not a problem
Yes, member calls are a little bit slower, but if you are careful it will not affect performance too much. On the other hand, OOP emulation code is much better to manage/read/understand.
Use an instruction database instead of fixed instruction decoding.
I am using a single text file which consist of all the necessary information for all instructions. The emulator parses it during initialization (feeds the arrays of function pointers and operands...). In this architecture it is very easy to correct errors in the instruction set without any code change.
Complex instruction sets documentation are almost always faulty to some point. The worst case is Z80 (I have never see a 100% error-free instruction set). So use more instruction sets, compare them and create an error-free set (if you can).
Add sound, video, keyboard and mouse to your emulation
This is usually not a problem. On Windows use WaveOut instead of DirectSound. It's more stable, much faster (usable latencies of DSound are sometimes even > 400 ms). With WaveOut I was able to lover latency to 20-80 ms which is OK.
Apply limit speed by T cycles of emulated CPU per second
I am using machine cycle correct timings which is much slower, but allows me to correctly implement any hardware periphery emulations as (FDC, DMAC, sound chips,...without any hacks)
Apply load/save of files for the emulated platform
For example, this is part of my instruction set (which is directly fed to CPU emulation:
opc T0 T1 MC1 MC2 MC3 MC4 MC5 MC6 MC7 mnemonic
B8 04 00 M1R 4 ... 0 ... 0 ... 0 ... 0 ... 0 ... 0 CP A,B
B9 04 00 M1R 4 ... 0 ... 0 ... 0 ... 0 ... 0 ... 0 CP A,C
BA 04 00 M1R 4 ... 0 ... 0 ... 0 ... 0 ... 0 ... 0 CP A,D
BB 04 00 M1R 4 ... 0 ... 0 ... 0 ... 0 ... 0 ... 0 CP A,E
BC 04 00 M1R 4 ... 0 ... 0 ... 0 ... 0 ... 0 ... 0 CP A,H
BD 04 00 M1R 4 ... 0 ... 0 ... 0 ... 0 ... 0 ... 0 CP A,L
BE 07 00 M1R 4 MRD 3 ... 0 ... 0 ... 0 ... 0 ... 0 CP A,(HL)
BF 04 00 M1R 4 ... 0 ... 0 ... 0 ... 0 ... 0 ... 0 CP A,A
C0 11 05 M1R 5 MRD 3 MRD 3 ... 0 ... 0 ... 0 ... 0 RET NZ
C1 10 00 M1R 4 MRD 3 MRD 3 ... 0 ... 0 ... 0 ... 0 POP BC
C2L2H2 10 10 M1R 4 MRD 3 MRD 3 ... 0 ... 0 ... 0 ... 0 JP NZ,U16
C3L1H1 10 00 M1R 4 MRD 3 MRD 3 ... 0 ... 0 ... 0 ... 0 JP U16
C4L2H2 17 10 M1R 4 MRD 3 MRD 4 MWR 3 MWR 3 ... 0 ... 0 CALL NZ,U16
C5 11 00 M1R 5 MWR 3 MWR 3 ... 0 ... 0 ... 0 ... 0 PUSH BC
C6U2 07 00 M1R 4 MRD 3 ... 0 ... 0 ... 0 ... 0 ... 0 ADD A,U8
C7 11 00 M1R 5 MWR 3 MWR 3 ... 0 ... 0 ... 0 ... 0 RST 00H
C8 11 05 M1R 5 MRD 3 MRD 3 ... 0 ... 0 ... 0 ... 0 RET Z
C9 10 00 M1R 4 MRD 3 MRD 3 ... 0 ... 0 ... 0 ... 0 RET
CAL2H2 10 10 M1R 4 MRD 3 MRD 3 ... 0 ... 0 ... 0 ... 0 JP Z,U16
opc: operation code [hex]
L1,H1,U1,S1 means first operand direct number or address
L2,H2,U2,S2 means second operand direct number or address
L3,H3,U3,S3 means third operand direct number or address
H,L ... U16 high and low byte
U ... U8 unsigned byte
S ... S8 signed byte
T0 normal instruction duration [T] always 2 decimal digits
T1 instruction duration if condition not met [T] always 2 decimal digits
MC1++ Machine cycle first is type,second is duration [T] always 1 decimal digit
... unused
M1R M1 cycle
MRD memory read
MWR memory write
IOR IO read
IOW IO write
NON no external operation (internal computation)
INT interrupt cycle
mnem instruction text (mnemonic)
opc is used for the address in an array of pointers
mnemonic is used to select the proper function pointer, and operands type
T0 and T1 are used for instructions timing (this is enough for rough emulations)
MC1++ are used for correct MC timings (to implement correct hardware emulation and contentions timing)
Here is my Zilog Z80A complete instruction set with machine cycle timing link for download. Feel free to use (just mention my nick somewhere). After porting to this I was finally able to 100% pass the ZEXALL test. For more info see Writing a graphical Z80 emulator in C or C++.

First suggestion, you shouldn't use nested switch statements, you should rather use array of function pointers, alot faster -> better emulation, and nicer code, nested switch-es can also get a bit messy, here are some links where you can read more about these arrays http://www.newty.de/fpt/fpt.html
http://www.multigesture.net/wp-content/uploads/mirror/zenogais/FunctionPointers.htm
Second suggestion, Yes you can do it in C#, Java, C++, but since you want every single bit of your CPU cycles so you can get as close emulation as possible - emulating one CPU cycle of target architecture with least number of CPU cycles on curret architecture, and OOP isn't so good in this case from what I heard/read from people. One of the things is performance, and second is pretty much obvious, emulation is, as you probably noticed, really complex task and wraping it in OOP can be unnecessary pain in the neck.

Here's a pretty cool implementation of working with some opcodes for an NES emulator:
http://bisqwit.iki.fi/jutut/kuvat/programming_examples/nesemu1/
Here's the accompanying youtube videos that have a little more explanation as to what's going on
http://www.youtube.com/watch?v=y71lli8MS8s
It uses C++ templates and some additional C++11 features. As to whether you choose C++ or C that is up to you but it shouldn't really matter a whole lot. If you're just emulating a gameboy I doubt that speed is going to be an issue on modern processors so try to just use whatever you're comfortable with.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight