Why is do...while slower than while when using gcc optimization [duplicate]

Why is do...while slower than while when using gcc optimization [duplicate] - c

This is related, but not the same, as this question: Performance optimisations of x86-64 assembly - Alignment and branch prediction and is slightly related to my previous question: Unsigned 64-bit to double conversion: why this algorithm from g++
The following is a not real-world test case. This primality testing algorithm is not sensible. I suspect any real-world algorithm would never execute such a small inner-loop quite so many times (num is a prime of size about 2**50). In C++11:
using nt = unsigned long long;
bool is_prime_float(nt num)
{
for (nt n=2; n<=sqrt(num); ++n) {
if ( (num%n)==0 ) { return false; }
}
return true;
}
Then g++ -std=c++11 -O3 -S produces the following, with RCX containing n and XMM6 containing sqrt(num). See my previous post for the remaining code (which is never executed in this example, as RCX never becomes large enough to be treated as a signed negative).
jmp .L20
.p2align 4,,10
.L37:
pxor %xmm0, %xmm0
cvtsi2sdq %rcx, %xmm0
ucomisd %xmm0, %xmm6
jb .L36 // Exit the loop
.L20:
xorl %edx, %edx
movq %rbx, %rax
divq %rcx
testq %rdx, %rdx
je .L30 // Failed divisibility test
addq $1, %rcx
jns .L37
// Further code to deal with case when ucomisd can't be used
I time this using std::chrono::steady_clock. I kept getting weird performance changes: from just adding or deleting other code. I eventually tracked this down to an alignment issue. The command .p2align 4,,10 tried to align to a 2**4=16 byte boundary, but only uses at most 10 bytes of padding to do so, I guess to balance between alignment and code size.
I wrote a Python script to replace .p2align 4,,10 by a manually controlled number of nop instructions. The following scatter plot shows the fastest 15 of 20 runs, time in seconds, number of bytes padding at the x-axis:
From objdump with no padding, the pxor instruction will occur at offset 0x402f5f. Running on a laptop, Sandybridge i5-3210m, turboboost disabled, I found that
For 0 byte padding, slow performance (0.42 secs)
For 1-4 bytes padding (offset 0x402f60 to 0x402f63) get slightly better (0.41s, visible on the plot).
For 5-20 bytes padding (offset 0x402f64 to 0x402f73) get fast performance (0.37s)
For 21-32 bytes padding (offset 0x402f74 to 0x402f7f) slow performance (0.42 secs)
Then cycles on a 32 byte sample
So a 16-byte alignment doesn't give the best performance-- it puts us in the slightly better (or just less variation, from the scatter plot) region. Alignment of 32 plus 4 to 19 gives the best performance.
Why am I seeing this performance difference? Why does this seem to violate the rule of aligning branch targets to a 16-byte boundary (see e.g. the Intel optimisation manual)
I don't see any branch-prediction problems. Could this be a uop cache quirk??
By changing the C++ algorithm to cache sqrt(num) in an 64-bit integer and then make the loop purely integer based, I remove the problem-- alignment now makes no difference at all.

Here's what I found on Skylake for the same loop. All the code to reproduce my tests on your hardware is on github.
I observe three different performance levels based on alignment, whereas the OP only really saw 2 primary ones. The levels are very distinct and repeatable2:
We see three distinct performance levels here (the pattern repeats starting from offset 32), which we'll call regions 1, 2 and 3, from left to right (region 2 is split into two parts straddling region 3). The fastest region (1) is from offset 0 to 8, the middle (2) region is from 9-18 and 28-31, and the slowest (3) is from 19-27. The difference between each region is close to or exactly 1 cycle/iteration.
Based on the performance counters, the fastest region is very different from the other two:
All the instructions are delivered from the legacy decoder, not from the DSB1.
There are exactly 2 decoder <-> microcode switches (idq_ms_switches) for every iteration of the loop.
On the hand, the two slower regions are fairly similar:
All the instructions are delivered from the DSB (uop cache), and not from the legacy decoder.
There are exactly 3 decoder <-> microcode switches per iteration of the loop.
The transition from the fastest to the middle region, as the offset changes from 8 to 9, corresponds exactly to when the loop starts fitting in the uop buffer, because of alignment issues. You count this out in exactly the same way as Peter did in his answer:
Offset 8:
LSD? <_start.L37>:
ab 1 4000a8: 66 0f ef c0 pxor xmm0,xmm0
ab 1 4000ac: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx
ab 1 4000b1: 66 0f 2e f0 ucomisd xmm6,xmm0
ab 1 4000b5: 72 21 jb 4000d8 <_start.L36>
ab 2 4000b7: 31 d2 xor edx,edx
ab 2 4000b9: 48 89 d8 mov rax,rbx
ab 3 4000bc: 48 f7 f1 div rcx
!!!! 4000bf: 48 85 d2 test rdx,rdx
4000c2: 74 0d je 4000d1 <_start.L30>
4000c4: 48 83 c1 01 add rcx,0x1
4000c8: 79 de jns 4000a8 <_start.L37>
In the first column I've annotated how the uops for each instruction end up in the uop cache. "ab 1" means they go in the set associated with address like ...???a? or ...???b? (each set covers 32 bytes, aka 0x20), while 1 means way 1 (out of a max of 3).
At the point !!! this busts out of the uop cache because the test instruction has no where to go, all the 3 ways are used up.
Let's look at offset 9 on the other hand:
00000000004000a9 <_start.L37>:
ab 1 4000a9: 66 0f ef c0 pxor xmm0,xmm0
ab 1 4000ad: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx
ab 1 4000b2: 66 0f 2e f0 ucomisd xmm6,xmm0
ab 1 4000b6: 72 21 jb 4000d9 <_start.L36>
ab 2 4000b8: 31 d2 xor edx,edx
ab 2 4000ba: 48 89 d8 mov rax,rbx
ab 3 4000bd: 48 f7 f1 div rcx
cd 1 4000c0: 48 85 d2 test rdx,rdx
cd 1 4000c3: 74 0d je 4000d2 <_start.L30>
cd 1 4000c5: 48 83 c1 01 add rcx,0x1
cd 1 4000c9: 79 de jns 4000a9 <_start.L37>
Now there is no problem! The test instruction has slipped into the next 32B line (the cd line), so everything fits in the uop cache.
So that explains why stuff changes between the MITE and DSB at that point. It doesn't, however, explain why the MITE path is faster. I tried some simpler tests with div in a loop, and you can reproduce this with simpler loops without any of the floating point stuff. It's weird and sensitive to random other stuff you put in the loop.
For example this loop also executes faster out of the legacy decoder than the DSB:
ALIGN 32
<add some nops here to swtich between DSB and MITE>
.top:
add r8, r9
xor eax, eax
div rbx
xor edx, edx
times 5 add eax, eax
dec rcx
jnz .top
In that loop, adding the pointless add r8, r9 instruction, which doesn't really interact with the rest of the loop, sped things up for the MITE version (but not the DSB version).
So I think the difference between region 1 an region 2 and 3 is due to the former executing out of the legacy decoder (which, oddly, makes it faster).
Let's also take a look at the offset 18 to offset 19 transition (where region2 ends and 3 starts):
Offset 18:
00000000004000b2 <_start.L37>:
ab 1 4000b2: 66 0f ef c0 pxor xmm0,xmm0
ab 1 4000b6: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx
ab 1 4000bb: 66 0f 2e f0 ucomisd xmm6,xmm0
ab 1 4000bf: 72 21 jb 4000e2 <_start.L36>
cd 1 4000c1: 31 d2 xor edx,edx
cd 1 4000c3: 48 89 d8 mov rax,rbx
cd 2 4000c6: 48 f7 f1 div rcx
cd 3 4000c9: 48 85 d2 test rdx,rdx
cd 3 4000cc: 74 0d je 4000db <_start.L30>
cd 3 4000ce: 48 83 c1 01 add rcx,0x1
cd 3 4000d2: 79 de jns 4000b2 <_start.L37>
Offset 19:
00000000004000b3 <_start.L37>:
ab 1 4000b3: 66 0f ef c0 pxor xmm0,xmm0
ab 1 4000b7: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx
ab 1 4000bc: 66 0f 2e f0 ucomisd xmm6,xmm0
cd 1 4000c0: 72 21 jb 4000e3 <_start.L36>
cd 1 4000c2: 31 d2 xor edx,edx
cd 1 4000c4: 48 89 d8 mov rax,rbx
cd 2 4000c7: 48 f7 f1 div rcx
cd 3 4000ca: 48 85 d2 test rdx,rdx
cd 3 4000cd: 74 0d je 4000dc <_start.L30>
cd 3 4000cf: 48 83 c1 01 add rcx,0x1
cd 3 4000d3: 79 de jns 4000b3 <_start.L37>
The only difference I see here is that the first 4 instructions in the offset 18 case fit into the ab cache line, but only 3 in the offset 19 case. If we hypothesize that the DSB can only deliver uops to the IDQ from one cache set, this means that at some point one uop may be issued and executed a cycle earlier in the offset 18 scenario than in the 19 scenario (imagine, for example, that the IDQ is empty). Depending on exactly what port that uop goes to in the context of the surrounding uop flow, that may delay the loop by one cycle. Indeed, the difference between region 2 and 3 is ~1 cycle (within the margin of error).
So I think we can say that the difference between 2 and 3 is likely due to uop cache alignment - region 2 has a slightly better alignment than 3, in terms of issuing one additional uop one cycle earlier.
Some addition notes on things I checked that didn't pan out as being a possible cause of the slowdowns:
Despite the DSB modes (regions 2 and 3) having 3 microcode switches versus the 2 of the MITE path (region 1), that doesn't seem to directly cause the slowdown. In particular, simpler loops with div execute in identical cycle counts, but still show 3 and 2 switches for DSB and MITE paths respectively. So that's normal and doesn't directly imply the slowdown.
Both paths execute essentially identical number of uops and, in particular, have identical number of uops generated by the microcode sequencer. So it's not like there is more overall work being done in the different regions.
There wasn't really an difference in cache misses (very low, as expected) at various levels, branch mispredictions (essentially zero3), or any other types of penalties or unusual conditions I checked.
What did bear fruit is looking at the pattern of execution unit usage across the various regions. Here's a look at the distribution of uops executed per cycle and some stall metrics:
+----------------------------+----------+----------+----------+
| | Region 1 | Region 2 | Region 3 |
+----------------------------+----------+----------+----------+
| cycles: | 7.7e8 | 8.0e8 | 8.3e8 |
| uops_executed_stall_cycles | 18% | 24% | 23% |
| exe_activity_1_ports_util | 31% | 22% | 27% |
| exe_activity_2_ports_util | 29% | 31% | 28% |
| exe_activity_3_ports_util | 12% | 19% | 19% |
| exe_activity_4_ports_util | 10% | 4% | 3% |
+----------------------------+----------+----------+----------+
I sampled a few different offset values and the results were consistent within each region, yet between the regions you have quite different results. In particular, in region 1, you have fewer stall cycles (cycles where no uop is executed). You also have significant variation in the non-stall cycles, although no clear "better" or "worse" trend is evident. For example, region 1 has many more cycles (10% vs 3% or 4%) with 4 uops executed, but the other regions largely make up for it with more cycles with 3 uops executed, and few cycles with 1 uop executed.
The difference in UPC4 that the execution distribution above implies fully explains the difference in performance (this is probably a tautology since we already confirmed the uop count is the same between them).
Let's see what toplev.py has to say about it ... (results omitted).
Well, toplev suggests that the primary bottleneck is the front-end (50+%). I don't think you can trust this because the way it calculates FE-bound seems broken in the case of long strings of micro-coded instructions. FE-bound is based on frontend_retired.latency_ge_8, which is defined as:
Retired instructions that are fetched after an interval where the
front-end delivered no uops for a period of 8 cycles which was not
interrupted by a back-end stall. (Supports PEBS)
Normally that makes sense. You are counting instructions which were delayed because the frontend wasn't delivering cycles. The "not interrupted by a back-end stall" condition ensures that this doesn't trigger when the front-end isn't delivering uops simply because is the backend is not able to accept them (e.g,. when the RS is full because the backend is performing some low-throuput instructions).
It kind of seems for div instructions - even a simple loop with pretty much just one div shows:
FE Frontend_Bound: 57.59 % [100.00%]
BAD Bad_Speculation: 0.01 %below [100.00%]
BE Backend_Bound: 0.11 %below [100.00%]
RET Retiring: 42.28 %below [100.00%]
That is, the only bottleneck is the front-end ("retiring" is not a bottleneck, it represents the useful work). Clearly, such a loop is trivially handled by the front-end and is instead limited by the backend's ability to chew threw all the uops generated by the div operation. Toplev might get this really wrong because (1) it may be that the uops delivered by the microcode sequencer aren't counted in the frontend_retired.latency... counters, so that every div operation causes that event to count all the subsequent instructions (even though the CPU was busy during that period - there was no real stall), or (2) the microcode sequencer might deliver all its ups essentially "up front", slamming ~36 uops into the IDQ, at which point it doesn't deliver any more until the div is finished, or something like that.
Still, we can look at the lower levels of toplev for hints:
The main difference toplev calls out between the regions 1 and region 2 and 3 is the increased penalty of ms_switches for the latter two regions (since they incur 3 every iteration vs 2 for the legacy path. Internally, toplev estimates a 2-cycle penalty in the frontend for such switches. Of course, whether these penalties actually slow anything down depends in a complex way on the instruction queue and other factors. As mentioned above, a simple loop with div doesn't show any difference between the DSB and MITE paths, a loop with additional instructions does. So it could be that the extra switch bubble is absorbed in simpler loops (where the backend processing of all the uops generated by the div is the main factor), but once you add some other work in the loop, the switches become a factor at least for the transition period between the div and non-div` work.
So I guess my conclusion is that the way the div instruction interacts with the rest of the frontend uop flow, and backend execution, isn't completely well understood. We know it involves a flood of uops, delivered both from the MITE/DSB (seems like 4 uops per div) and from the microcode sequencer (seems like ~32 uops per div, although it changes with different input values to the div op) - but we don't know what those uops are (we can see their port distribution though). All that makes the behavior fairly opaque, but I think it is probably down to either the MS switches bottlnecking the front-end, or slight differences in the uop delivery flow resulting in different scheduling decisions which end up making the MITE order master.
1 Of course, most of the uops are not delivered from the legacy decoder or DSB at all, but by the microcode sequencer (ms). So we loosely talk about instructions delivered, not uops.
2 Note that the x axis here is "offset bytes from 32B alignment". That is, 0 means the top of the loop (label .L37) is aligned to a 32B boundary, and 5 means the loop starts five bytes below a 32B boundary (using nop for padding) and so on. So my padding bytes and offset are the same. The OP used a different meaning for offset, if I understand it correctly: his 1 byte of padding resulted in a 0 offset. So you would subtract 1 from the OPs padding values to get my offset values.
3 In fact, the branch prediction rate for a typical test with prime=1000000000000037 was ~99.999997%, reflecting only 3 mispredicted branches in the entire run (likely on the first pass through the loop, and the last iteration).
4 UPC, i.e., uops per cycle - a measure closely related to IPC for similar programs, and one that is a bit more precise when we are looking in detail at uop flows. In this case, we already know the uop counts are the same for all variations of alignment, so UPC and IPC will be directly proportional.

I don't have a specific answer, just a few different hypotheses that I'm unable to test (lack of hardware). I thought I'd found something conclusive, but I had the alignment off by one (because the question counts padding from 0x5F, not from an aligned boundary). Anyway, hopefully it's useful to post this anyway to describe the factors that are probably at play here.
The question also doesn't specify the encoding of the branches (short (2B) or near (6B)). This leaves too many possibilities to look at and theorize about exactly which instruction crossing a 32B boundary or not is causing the issue.
I think it's either a matter of the loop fitting in the uop cache or not, or else it's a matter of alignment mattering for whether it decodes fast with the legacy decoders.
Obviously that asm loop could be improved a lot (e.g. by hoisting the floating-point out of it, not to mention using a different algorithm entirely), but that's not the question. We just want to know why alignment matters for this exact loop.
You might expect that a loop that bottlenecks on division wouldn't bottleneck on the front-end or be affected by alignment, because division is slow and the loop runs very few instructions per clock. That's true, but 64-bit DIV is micro-coded as 35-57 micro-ops (uops) on IvyBridge, so it turns out there can be front-end issues.
The two main ways alignment can matter are:
Front-end bottlenecks (in the fetch/decode stages), leading to bubbles in keeping the out-of-order core supplied with work to do.
Branch prediction: if two branches have the same address modulo some large power of 2, they can alias each other in the branch prediction hardware. Code alignment in one object file is affecting the performance of a function in another object file
scratches the surface of this issue, but much has been written about it.
I suspect this is a purely front-end issue, not branch prediction, since the code spends all its time in this loop, and isn't running other branches that might alias with the ones here.
Your Intel IvyBridge CPU is a die-shrink of SandyBridge. It has a few changes (like mov-elimination, and ERMSB), but the front-end is similar between SnB/IvB/Haswell. Agner Fog's microarch pdf has enough details to analyze what should happen when the CPU runs this code. See also David Kanter's SandyBridge writeup for a block diagram of the fetch/decode stages, but he splits the fetch/decode from the uop cache, microcode, and decoded-uop queue. At the end, there's a full block diagram of a whole core. His Haswell article has a block diagram including the whole front-end, up to the decoded-uop queue that feeds the issue stage. (IvyBridge, like Haswell, has a 56 uop queue / loopback buffer when not using Hyperthreading. Sandybridge statically partitions them into 2x28 uop queues even when HT is disabled.)
Image copied from David Kanter's also-excellent Haswell write-up, where he includes the decoders and uop-cache in one diagram.
Let's look at how the uop cache will probably cache this loop, once things settle down. (i.e. assuming that the loop entry with a jmp to the middle of the loop doesn't have any serious long-term effect on how the loop sits in the uop cache).
According to Intel's optimization manual (2.3.2.2 Decoded ICache):
All micro-ops in a Way (uop cache line) represent instructions which are statically contiguous in the code and have
their EIPs within the same aligned 32-byte region. (I think this means an instruction that extends past the boundary goes in the uop cache for the block containing its start, rather than end. Spanning instructions have to go somewhere, and the branch target address that would run the instruction is the start of the insn, so it's most useful to put it in a line for that block).
A multi micro-op instruction cannot be split across Ways.
An instruction which turns on the MSROM consumes an entire Way. (i.e. any instruction that takes more than 4 uops (for the reg,reg form) is microcoded. For example, DPPD is not micro-coded (4 uops), but DPPS is (6 uops). DPPD with a memory operand that can't micro-fuse would be 5 total uops, but still wouldn't need to turn on the microcode sequencer (not tested).
Up to two branches are allowed per Way.
A pair of macro-fused instructions is kept as one micro-op.
David Kanter's SnB writeup has some more great details about the uop cache.
Let's see how the actual code will go into the uop cache
# let's consider the case where this is 32B-aligned, so it runs in 0.41s
# i.e. this is at 0x402f60, instead of 0 like this objdump -Mintel -d output on a .o
# branch displacements are all 00, and I forgot to put in dummy labels, so they're using the rel32 encoding not rel8.
0000000000000000 <.text>:
0: 66 0f ef c0 pxor xmm0,xmm0 # 1 uop
4: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx # 2 uops
9: 66 0f 2e f0 ucomisd xmm6,xmm0 # 2 uops
d: 0f 82 00 00 00 00 jb 0x13 # 1 uop (end of one uop cache line of 6 uops)
13: 31 d2 xor edx,edx # 1 uop
15: 48 89 d8 mov rax,rbx # 1 uop (end of a uop cache line: next insn doesn't fit)
18: 48 f7 f1 div rcx # microcoded: fills a whole uop cache line. (And generates 35-57 uops)
1b: 48 85 d2 test rdx,rdx ### PROBLEM!! only 3 uop cache lines can map to the same 32-byte block of x86 instructions.
# So the whole block has to be re-decoded by the legacy decoders every time, because it doesn't fit in the uop-cache
1e: 0f 84 00 00 00 00 je 0x24 ## spans a 32B boundary, so I think it goes with TEST in the line that includes the first byte. Should actually macro-fuse.
24: 48 83 c1 01 add rcx,0x1 # 1 uop
28: 79 d6 jns 0x0 # 1 uop
So with 32B alignment for the start of the loop, it has to run from the legacy decoders, which is potentially slower than running from the uop cache. There could even be some overhead in switching from uop cache to legacy decoders.
#Iwill's testing (see comments on the question) reveals that any microcoded instruction prevents a loop from running from the loopback buffer. See comments on the question. (LSD = Loop Stream Detector = loop buffer; physically the same structure as the IDQ (instruction decode queue). DSB = Decode Stream Buffer = the uop cache. MITE = legacy decoders.)
Busting the uop cache will hurt performance even if the loop is small enough to run from the LSD (28 uops minimum, or 56 without hyperthreading on IvB and Haswell).
Intel's optimization manual (section 2.3.2.4) says the LSD requirements include
All micro-ops are also resident in the Decoded ICache.
So this explains why microcode doesn't qualify: in that case the uop-cache only holds a pointer into to microcode, not the uops themselves. Also note that this means that busting the uop cache for any other reason (e.g. lots of single-byte NOP instructions) means a loop can't run from the LSD.
With the minimum padding to go fast, according to the OP's testing.
# branch displacements are still 32-bit, except the loop branch.
# This may not be accurate, since the question didn't give raw instruction dumps.
# the version with short jumps looks even more unlikely
0000000000000000 <loop_start-0x64>:
...
5c: 00 00 add BYTE PTR [rax],al
5e: 90 nop
5f: 90 nop
60: 90 nop # 4NOPs of padding is just enough to bust the uop cache before (instead of after) div, if they have to go in the uop cache.
# But that makes little sense, because looking backward should be impossible (insn start ambiguity), and we jump into the loop so the NOPs don't even run once.
61: 90 nop
62: 90 nop
63: 90 nop
0000000000000064 <loop_start>: #uops #decode in cycle A..E
64: 66 0f ef c0 pxor xmm0,xmm0 #1 A
68: f2 48 0f 2a c1 cvtsi2sd xmm0,rcx #2 B
6d: 66 0f 2e f0 ucomisd xmm6,xmm0 #2 C (crosses 16B boundary)
71: 0f 82 db 00 00 00 jb 152 #1 C
77: 31 d2 xor edx,edx #1 C
79: 48 89 d8 mov rax,rbx #1 C
7c: 48 f7 f1 div rcx #line D
# 64B boundary after the REX in next insn
7f: 48 85 d2 test rdx,rdx #1 E
82: 74 06 je 8a <loop_start+0x26>#1 E
84: 48 83 c1 01 add rcx,0x1 #1 E
88: 79 da jns 64 <loop_start>#1 E
The REX prefix of test rdx,rdx is in the same block as the DIV, so this should bust the uop cache. One more byte of padding would put it into the next 32B block, which would make perfect sense. Perhaps the OP's results are wrong, or perhaps prefixes don't count, and it's the position of the opcode byte that matters. Perhaps that matters, or perhaps a macro-fused test+branch is pulled to the next block?
Macro-fusion does happen across the 64B L1I-cache line boundary, since it doesn't fall on the boundary between instructions.
Macro fusion does not happen if the first instruction ends on byte 63 of a cache line, and the second instruction is a conditional branch that starts at byte 0 of the next cache line. -- Intel's optimization manual, 2.3.2.1
Or maybe with a short encoding for one jump or the other, things are different?
Or maybe busting the uop cache has nothing to do with it, and that's fine as long as it decodes fast, which this alignment makes happen. This amount of padding just barely puts the end of UCOMISD into a new 16B block, so maybe that actually improves efficiency by letting it decode with the other instructions in the next aligned 16B block. However, I'm not sure that a 16B pre-decode (instruction-length finding) or 32B decode block have to be aligned.
I also wondered if the CPU ends up switching from uop cache to legacy decode frequently. That can be worse than running from legacy decode all the time.
Switching from the decoders to the uop cache or vice versa takes a cycle, according to Agner Fog's microarch guide. Intel says:
When micro-ops cannot be stored in the Decoded ICache due to these restrictions, they are delivered from the legacy decode pipeline. Once micro-ops are delivered from the legacy pipeline, fetching micro-
ops from the Decoded ICache can resume only after the next branch micro-op. Frequent switches can incur a penalty.
The source that I assembled + disassembled:
.skip 0x5e
nop
# this is 0x5F
#nop # OP needed 1B of padding to reach a 32B boundary
.skip 5, 0x90
.globl loop_start
loop_start:
.L37:
pxor %xmm0, %xmm0
cvtsi2sdq %rcx, %xmm0
ucomisd %xmm0, %xmm6
jb .Loop_exit // Exit the loop
.L20:
xorl %edx, %edx
movq %rbx, %rax
divq %rcx
testq %rdx, %rdx
je .Lnot_prime // Failed divisibility test
addq $1, %rcx
jns .L37
.skip 200 # comment this to make the jumps rel8 instead of rel32
.Lnot_prime:
.Loop_exit:

From what I can see in your algorithm, there is certainly not much you can do to improve it.
The problem you are hitting is probably not so much the branch to an aligned position, although that can still help, you're current problem is much more likely the pipeline mechanism.
When you write two instructions one after another such as:
mov %eax, %ebx
add 1, %ebx
In order to execute the second instruction, the first one has to be complete. For that reason compilers tend to mix instructions. Say you need to set %ecx to zero, you could do this:
mov %eax, %ebx
xor %ecx, %ecx
add 1, %ebx
In this case, the mov and the xor can both be executed in parallel. This makes things go faster... The number of instructions that can be handled in parallel vary very much between processors (Xeons are generally better at that).
The branch adds another parameter where the best processors may start executing both sides of the branch (the true and the false...) simultaneously. But really most processors will make a guess and hope they are right.
Finally, it is obvious that converting the sqrt() result to an integer will make things a lot faster since you will avoid all that non-sense with SSE2 code which is definitively slower if used only for a conversion + compare when those two instructions could be done with integers.
Now... you are probably still wondering why the alignment does not matter with the integers. The fact is that if your code fits in the L1 instruction cache, then the alignment is not important. If you lose the L1 cache, then it has to reload the code and that's where the alignment becomes quite important since on each loop it could otherwise be loading useless code (possibly 15 bytes of useless code...) and memory access is still dead slow.

The performance difference can be explained by the different ways the instruction encoding mechanism "sees" the instructions. A CPU reads the instructions in chunks (was on core2 16 byte I believe) and it tries to give the different superscalar units microops. If the instructions are on boundaries or ordered unlikely the units in one core can starve quite easily.

Related

Why in the C language loop, accessing array elements is so much slower than accessing variables？ [duplicate]

I want to be able to predict, by hand, exactly how long arbitrary arithmetical (i.e. no branching or memory, though that would be nice too) x86-64 assembly code will take given a particular architecture, taking into account instruction reordering, superscalarity, latencies, CPIs, etc.
What / describe the rules must be followed to achieve this?
I think I've got some preliminary rules figured out, but I haven't been able to find any references on breaking down any example code to this level of detail, so I've had to take some guesses. (For example, the Intel optimization manual barely even mentions instruction reordering.)
At minimum, I'm looking for (1) confirmation that each rule is correct or else a correct statement of each rule, and (2) a list of any rules that I may have forgotten.
As many instructions as possible are issued each cycle, starting in-order from the current cycle and potentially as far ahead as the reorder buffer size.
An instruction can be issued on a given cycle if:
No instructions that affect its operands are still being executed. And:
If it is a floating-point instruction, every floating-point instruction before it has been issued (floating-point instructions have static instruction re-ordering). And:
There is a functional unit available for that instruction on that cycle. Every (?) functional unit is pipelined, meaning it can accept 1 new instruction per cycle, and the number of total functional units is 1/CPI, for the CPI of a given function class (nebulous here: presumably e.g. addps and subps use the same functional unit? How do I determine this?). And:
Fewer than the superscalar width (typically 4) number of instructions have already been issued this cycle.
If no instructions can be issued, the processor simply doesn't issue any—a condition called a "stall".
As an example, consider the following example code (which computes a cross-product):
shufps xmm3, xmm2, 210
shufps xmm0, xmm1, 201
shufps xmm2, xmm2, 201
mulps xmm0, xmm3
shufps xmm1, xmm1, 210
mulps xmm1, xmm2
subps xmm0, xmm1
My attempt to predict the latency for Haswell looks something like this:
; `mulps` Haswell latency=5, CPI=0.5
; `shufps` Haswell latency=1, CPI=1
; `subps` Haswell latency=3, CPI=1
shufps xmm3, xmm2, 210 ; cycle 1
shufps xmm0, xmm1, 201 ; cycle 2
shufps xmm2, xmm2, 201 ; cycle 3
mulps xmm0, xmm3 ; (superscalar execution)
shufps xmm1, xmm1, 210 ; cycle 4
mulps xmm1, xmm2 ; cycle 5
; cycle 6 (stall `xmm0` and `xmm1`)
; cycle 7 (stall `xmm1`)
; cycle 8 (stall `xmm1`)
subps xmm0, xmm1 ; cycle 9
; cycle 10 (stall `xmm0`)

TL:DR: look for dependency chains, especially loop-carried ones. For a long-running loop, see which latency, front-end throughput, or back-end port contention/throughput is the worst bottleneck. That's how many cycles your loop probably takes per iteration, on average, if there are no cache misses or branch mispredicts.
Latency bounds and throughput bounds for processors for operations that must occur in sequence is a good example of analyzing loop-carried dependency chains in a specific loop with two dep chains, one pulling values from the other.
Related: How many CPU cycles are needed for each assembly instruction? is a good introduction to throughput vs. latency on a per-instruction basis, and how what that means for sequences of multiple instructions. See also Assembly - How to score a CPU instruction by latency and throughput for how to measure a single instruction.
This is called static (performance) analysis. Wikipedia says (https://en.wikipedia.org/wiki/List_of_performance_analysis_tools) that AMD's AMD CodeXL has a "static kernel analyzer" (i.e. for computational kernels, aka loops). I've never tried it.
Intel also has a free tool for analyzing how loops will go through the pipeline in Sandybridge-family CPUs: What is IACA and how do I use it?
IACA is not bad, but has bugs (e.g. wrong data for shld on Sandybridge, and last I checked, it doesn't know that Haswell/Skylake can keep indexed addressing modes micro-fused for some instructions. But maybe that will change now that Intel's added details on that to their optimization manual.) IACA is also unhelpful for counting front-end uops to see how close to a bottleneck you are (it likes to only give you unfused-domain uop counts).
Static analysis is often pretty good, but definitely check by profiling with performance counters. See Can x86's MOV really be "free"? Why can't I reproduce this at all? for an example of profiling a simple loop to investigate a microarchitectural feature.
Essential reading:
Agner Fog's microarch guide (chapter 2: Out of order exec) explains some of the basics of dependency chains and out-of-order execution. His "Optimizing Assembly" guide has more good introductory and advanced performance stuff.
The later chapters of his microarch guide cover the details of the pipelines in CPUs like Nehalem, Sandybridge, Haswell, K8/K10, Bulldozer, and Ryzen. (And Atom / Silvermont / Jaguar).
Agner Fog's instruction tables (spreadsheet or PDF) are also normally the best source for instruction latency / throughput / execution-port breakdowns.
David Kanter's microarch analysis docs are very good, with diagrams. e.g. https://www.realworldtech.com/sandy-bridge/, https://www.realworldtech.com/haswell-cpu/, and https://www.realworldtech.com/bulldozer/.
See also other performance links in the x86 tag wiki.
I also took a stab at explaining how a CPU core finds and exploits instruction-level parallelism in this answer, but I think you've already grasped those basics as far as it's relevant for tuning software. I did mention how SMT (Hyperthreading) works as a way to expose more ILP to a single CPU core, though.
In Intel terminology:
"issue" means to send a uop into the out-of-order part of the core; along with register-renaming, this is the last step in the front-end. The issue/rename stage is often the narrowest point in the pipeline, e.g. 4-wide on Intel since Core2. (With later uarches like Haswell and especially Skylake often actually coming very close to that in some real code, thanks to SKL's improved decoders and uop-cache bandwidth, as well as back-end and cache bandwidth improvements.) This is fused-domain uops: micro-fusion lets you send 2 uops through the front-end and only take up one ROB entry. (I was able to construct a loop on Skylake that sustains 7 unfused-domain uops per clock). See also http://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ re: out-of-order window size.
"dispatch" means the scheduler sends a uop to an execution port. This happens as soon as all the inputs are ready, and the relevant execution port is available. How are x86 uops scheduled, exactly?. Scheduling happens in the "unfused" domain; micro-fused uops are tracked separately in the OoO scheduler (aka Reservation Station, RS).
A lot of other computer-architecture literature uses these terms in the opposite sense, but this is the terminology you will find in Intel's optimization manual, and the names of hardware performance counters like uops_issued.any or uops_dispatched_port.port_5.
exactly how long arbitrary arithmetical x86-64 assembly code will take
It depends on the surrounding code as well, because of OoO exec
Your final subps result doesn't have to be ready before the CPU starts running later instructions. Latency only matters for later instructions that need that value as an input, not for integer looping and whatnot.
Sometimes throughput is what matters, and out-of-order exec can hide the latency of multiple independent short dependency chains. (e.g. if you're doing the same thing to every element of a big array of multiple vectors, multiple cross products can be in flight at once.) You'll end up with multiple iterations in flight at once, even though in program order you finish all of one iteration before doing any of the next. (Software pipelining can help for high-latency loop bodies if OoO exec has a hard time doing all the reordering in HW.)
There are three major dimensions to analyze for a short block
You can approximately characterize a short block of non-branching code in terms of these three factors. Usually only one of them is the bottleneck for a given use-case. Often you're looking at a block that you will use as part of a loop, not as the whole loop body, but OoO exec normally works well enough that you can just add up these numbers for a couple different blocks, if they're not so long that OoO window size prevents finding all the ILP.
latency from each input to the output(s). Look at which instructions are on the dependency chain from each input to each output. e.g. one choice might need one input to be ready sooner.
total uop count (for front-end throughput bottlenecks), fused-domain on Intel CPUs. e.g. Core2 and later can in theory issue/rename 4 fused-domain uops per clock into the out-of-order scheduler/ROB. Sandybridge-family can often achieve that in practice with the uop cache and loop buffer, especially Skylake with its improved decoders and uop-cache throughput.
uop count for each back-end execution port (unfused domain). e.g. shuffle-heavy code will often bottleneck on port 5 on Intel CPUs. Intel usually only publishes throughput numbers, not port breakdowns, which is why you have to look at Agner Fog's tables (or IACA output) to do anything meaningful if you're not just repeating the same instruction a zillion times.
Generally you can assuming best-case scheduling/distribution, with uops that can run on other ports not stealing the busy ports very often, but it does happen some. (How are x86 uops scheduled, exactly?)
Looking at CPI is not sufficient; two CPI=1 instructions might or might not compete for the same execution port. If they don't, they can execute in parallel. e.g. Haswell can only run psadbw on port 0 (5c latency, 1c throughput, i.e. CPI=1) but it's a single uop so a mix of 1 psadbw + 3 add instructions could sustain 4 instructions per clock. There are vector ALUs on 3 different ports in Intel CPUs, with some operations replicated on all 3 (e.g. booleans) and some only on one port (e.g. shifts before Skylake).
Sometimes you can come up with a couple different strategies, one maybe lower latency but costing more uops. A classic example is multiplying by constants like imul eax, ecx, 10 (1 uop, 3c latency on Intel) vs. lea eax, [rcx + rcx*4] / add eax,eax (2 uops, 2c latency). Modern compilers tend to choose 2 LEA vs. 1 IMUL, although clang up to 3.7 favoured IMUL unless it could get the job done with only a single other instruction.
See What is the efficient way to count set bits at a position or lower? for an example of static analysis for a few different ways to implement a function.
See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) (which ended up being way more detailed than you'd guess from the question title) for another summary of static analysis, and some neat stuff about unrolling with multiple accumulators for a reduction.
Every (?) functional unit is pipelined
The divider is pipelined in recent CPUs, but not fully pipelined. (FP divide is single-uop, though, so if you do one divps mixed in with dozens of mulps / addps, it can have negligible throughput impact if latency doesn't matter: Floating point division vs floating point multiplication. rcpps + a Newton iteration is worse throughput and about the same latency.
Everything else is fully pipelined on mainstream Intel CPUs; multi-cycle (reciprocal) throughput for a single uop. (variable-count integer shifts like shl eax, cl have lower-than-expected throughput for their 3 uops, because they create a dependency through the flag-merging uops. But if you break that dependency through FLAGS with an add or something, you can get better throughput and latency.)
On AMD before Ryzen, the integer multiplier is also only partially pipelined. e.g. Bulldozer's imul ecx, edx is only 1 uop, but with 4c latency, 2c throughput.
Xeon Phi (KNL) also has some not-fully-pipelined shuffle instructions, but it tends to bottleneck on the front-end (instruction decode), not the back-end, and does have a small buffer + OoO exec capability to hide back-end bubbles.
If it is a floating-point instruction, every floating-point instruction before it has been issued (floating-point instructions have static instruction re-ordering)
No.
Maybe you read that for Silvermont, which doesn't do OoO exec for FP/SIMD, only integer (with a small ~20 uop window). Maybe some ARM chips are like that, too, with simpler schedulers for NEON? I don't know much about ARM uarch details.
The mainstream big-core microarchitectures like P6 / SnB-family, and all AMD OoO chips, do OoO exec for SIMD and FP instructions the same as for integer. AMD CPUs use a separate scheduler, but Intel uses a unified scheduler so its full size can be applied to finding ILP in integer or FP code, whichever is currently running.
Even the silvermont-based Knight's Landing (in Xeon Phi) does OoO exec for SIMD.
x86 is generally not very sensitive to instruction ordering, but uop scheduling doesn't do critical-path analysis. So it could sometimes help to put instructions on the critical path first, so they aren't stuck waiting with their inputs ready while other instructions run on that port, leading to a bigger stall later when we get to instructions that need the result of the critical path. (i.e. that's why it is the critical path.)
My attempt to predict the latency for Haswell looks something like this:
Yup, that looks right. shufps runs on port 5, addps runs on p1, mulps runs on p0 or p1. Skylake drops the dedicated FP-add unit and runs SIMD FP add/mul/FMA on the FMA units on p0/p1, all with 4c latency (up/down from 3/5/5 in Haswell, or 3/3/5 in Broadwell).
This is a good example of why keeping a whole XYZ direction vector in a SIMD vector usually sucks. Keeping an array of X, an array of Y, and an array of Z, would let you do 4 cross products in parallel without any shuffles.
The SSE tag wiki has a link to these slides: SIMD at Insomniac Games (GDC 2015) which covers that array-of-structs vs. struct-of-arrays issues for 3D vectors, and why it's often a mistake to always try to SIMD a single operation instead of using SIMD to do multiple operations in parallel.

decode ARM BL instruction

I'm just getting started with the ARM architecture on my Nucleo STM32F303RE, and I'm trying to understand how the instructions are encoded.
I have running a simple LED-blinking program, and the first few disassembled application instructions are:
08000188: push {lr}
0800018a: sub sp, #12
235 __initialize_hardware_early ();
0800018c: bl 0x80005b8 <__initialize_hardware_early>
These instructions resolve to the following in the hex file (displayed weird in Eclipse -- each 32-bit word is in MSB order, but Eclipse doesn't seem to know it... but that's for another topic):
address 0x08000188: B083B500 FA14F000
Using the ARM Architecture Ref Manual, I've confirmed the first 2 instructions, push (0xB500) and sub (0xB083). But I can't make any sense out of the "bl" instruction.
The hex instruction is 0xFA14F000. The Ref Manual says it breaks down like this:
31.28 27 26 25 24 23............0
cond 1 0 1 L signed_immed_24
The first "F" (0xF......) makes sense: all conditions are set (ALways).
The "A" doesn't make sense though, since the L bit should be set (1011). Shouldn't it be 0xFB......?
And the signed_immed_24 doesn't make sense, either. The ref manual says:
- start with 0x14F000
- sign extend to 30 bits (signed 2's-complement), giving 0x0014F000
- shift left to form 32-bit value, giving 0x0053C000
- add to the PC, which is the current instruction + 8, giving 0x0800018c + 8 + 0x0053C000, or 0x0853C194.
So I get a branch address of 0x0853C194, but the disassembly shows 0x080005B8.
What am I missing?
Thanks!
-Eric

bl is two, separate, 16 bit instructions. The armv5 (and older) ARM ARM does a better job of documenting them.
111HHoffset11
From the ARM ARM
The first Thumb instruction has H == 10 and supplies the high part of
the branch offset. This instruction sets up for the subroutine call
and is shared between the BL and BLX forms.
The second Thumb instruction has H == 11 (for BL) or H == 01 (for
BLX). It supplies the low part of the branch offset and causes the
subroutine call to take place.
0xFA14 0xF000
0xF000 is the first instruction upper offset is zeros
0xFA14 is the second instruction offset is 0x214
If starting at 0x0800018c then it is 0x0800018C + 4 + (0x0000214<<1) = 0x080005B8. The 4 is the two instructions head for the current PC. And the offset is units of (16 bit) instructions.
I guess the armv7-m ARM ARM covers it as well, but is harder to read, and apparently features were added. But they do not affect you with this branch link.
The ARMv5 ARM ARM does a better job of describing what happens as well. you can certaily take these two separate instructions and move them apart
.byte 0x00,0xF0
nop
nop
nop
nop
nop
.byte 0x14,0xFA
and it will branch to the same offset (relative to the second instruction). Maybe the broke that in some cores, but I know in some (after armv5) it works.

SIMD versions of SHLD/SHRD instructions

SHLD/SHRD instructions are assembly instructions to implement multiprecisions shifts.
Consider the following problem:
uint64_t array[4] = {/*something*/};
left_shift(array, 172);
right_shift(array, 172);
What is the most efficient way to implement left_shift and right_shift, two functions that operates a shift on an array of four 64-bit unsigned integer as if it was a big 256 bits unsigned integer?
Is the most efficient way of doing that is by using SHLD/SHRD instructions, or is there better (like SIMD versions) instructions on modern architecture?

In this answer I'm only going to talk about x64.
x86 has been outdated for 15 years now if you're coding in 2016 it hardly makes sense to be stuck in 2000.
All times are according to Agner Fog's instruction tables.
Intel Skylake example timings*
The shld/shrd instructions are rather slow on x64.
Even on Intel skylake they have a latency of 4 cycles and uses 4 uops meaning it uses up a lot of execution units, on older processors they're even slower.
I'm going to assume you want to shift by a variable amount, which means a
SHLD RAX,RDX,cl 4 uops, 4 cycle latency. -> 1/16 per bit
Using 2 shifts + add you can do this faster slower.
#Init:
MOV R15,-1
SHR R15,cl //mask for later use.
#Work:
SHL RAX,cl 3 uops, 2 cycle latency
ROL RDX,cl 3 uops, 2 cycle latency
AND RDX,R15 1 uops, 0.25 latency
OR RAX,RDX 1 uops, 0.25 latency
//Still needs unrolling to achieve least amount of slowness.
Note that this only shifts 64 bits because RDX is not affected.
So you're trying to beat 4 cycles per 64 bits.
//4*64 bits parallel shift.
//Shifts in zeros.
VPSLLVQ YMM2, YMM2, YMM3 1uop, 0.5 cycle latency.
However if you want it to do exactly what SHLD does you'll need to use an extra VPSLRVQ and an OR to combine the two results.
VPSLLVQ YMM1, YMM2, YMM3 1uop, 0.5 cycle latency.
VPSRLVQ YMM5, YMM2, YMM4 1uop, 0.5 cycle latency.
VPOR YMM1, YMM1, YMM5 1uop, 0.33 cycle latency.
You'll need to interleave 4 sets of these costing you (3*4)+2=14 YMM registers.
Doing so I doubt you'll profit from the low .33 latency of VPADDQ so I'll assume a 0.5 latency instead.
That makes 3uops, 1.5 cycle latency for 256 bits = 1/171 per bit = 0.37 cycle per QWord = 10x faster, not bad.
If you are able to get 1.33 cycle per 256 bits = 1/192 per bit = 0.33 cycle per QWord = 12x faster.
'It’s the Memory, Stupid!'
Obviously I've not added in loop overhead and load/stores to/from memory.
The loop overhead is tiny given proper alignment of jump targets, but the memory
access will easily be the biggest slowdown.
A single cache miss to main memory on Skylake can cost you more than 250 cycles1.
It is in clever management of memory that the major gains will be made.
The 12 times possible speed-up using AVX256 is small potatoes in comparison.
I'm not counting the set up of the shift counter in CL/(YMM3/YMM4) because I'm assuming you'll reuse that value over many iterations.
You're not going to beat that with AVX512 instructions, because consumer grade CPU's with AVX512 instructions are not yet available.
The only current processor that supports currently is Knights Landing.
*) All these timings are best case values, and should be taken as indications, not as hard values.
1) Cost of cache miss in Skylake: 42 cycles + 52ns = 42 + (52*4.6Ghz) = 281 cycles.

AMD64 -- nopw assembly instruction?

In this compiler output, I'm trying to understand how machine-code encoding of the nopw instruction works:
00000000004004d0 <main>:
4004d0: eb fe jmp 4004d0 <main>
4004d2: 66 66 66 66 66 2e 0f nopw %cs:0x0(%rax,%rax,1)
4004d9: 1f 84 00 00 00 00 00
There is some discussion about "nopw" at http://john.freml.in/amd64-nopl. Can anybody explain the meaning of 4004d2-4004e0? From looking at the opcode list, it seems that 66 .. codes are multi-byte expansions. I feel I could probably get a better answer to this here than I would unless I tried to grok the opcode list for a few hours.
That asm output is from the following (insane) code in C, which optimizes down to a simple infinite loop:
long i = 0;
main() {
recurse();
}
recurse() {
i++;
recurse();
}
When compiled with gcc -O2, the compiler recognizes the infinite recursion and turns it into an infinite loop; it does this so well, in fact, that it actually loops in the main() without calling the recurse() function.
editor's note: padding functions with NOPs isn't specific to infinite loops. Here's a set of functions with a range of lengths of NOPs, on the Godbolt compiler explorer.

The 0x66 bytes are an "Operand-Size Override" prefix. Having more than one of these is equivalent to having one.
The 0x2e is a 'null prefix' in 64-bit mode (it's a CS: segment override otherwise - which is why it shows up in the assembly mnemonic).
0x0f 0x1f is a 2 byte opcode for a NOP that takes a ModRM byte
0x84 is ModRM byte which in this case codes for an addressing mode that uses 5 more bytes.
Some CPUs are slow to decode instructions with many prefixes (e.g. more than three), so a ModRM byte that specifies a SIB + disp32 is a much better way to use up an extra 5 bytes than five more prefix bytes.
AMD K8 decoders in Agner Fog's microarch pdf:
Each of the instruction decoders can handle three prefixes per clock
cycle. This means that three instructions with three prefixes each can
be decoded in the same clock cycle. An instruction with 4 - 6 prefixes
takes an extra clock cycle to decode.
Essentially, those bytes are one long NOP instruction that will never get executed anyway. It's in there to ensure that the next function is aligned on a 16-byte boundary, because the compiler emitted a .p2align 4 directive, so the assembler padded with a NOP. gcc's default for x86 is
-falign-functions=16. For NOPs that will be executed, the optimal choice of long-NOP depends on the microarchitecture. For a microarchitecture that chokes on many prefixes, like Intel Silvermont or AMD K8, two NOPs with 3 prefixes each might have decoded faster.
The blog article the question linked to ( http://john.freml.in/amd64-nopl ) explains why the compiler uses a complicated single NOP instruction instead of a bunch of single-byte 0x90 NOP instructions.
You can find the details on the instruction encoding in AMD's tech ref documents:
http://developer.amd.com/documentation/guides/pages/default.aspx#manuals
Mainly in the "AMD64 Architecture Programmer's Manual Volume 3: General Purpose and System Instructions". I'm sure Intel's technical references for the x64 architecture will have the same information (and might even be more understandable).

The assembler (not the compiler) pads code up to the next alignment boundary with the longest NOP instruction it can find that fits. This is what you're seeing.

I would guess this is just the branch-delay instruction.

I belive that the nopw is junk - i is never read in your program, and there are thus no need to increment it.

How many asm-instructions per C-instruction?

I realize that this question is impossible to answer absolutely, but I'm only after ballpark figures:
Given a reasonably sized C-program (thousands of lines of code), on average, how many ASM-instructions would be generated. In other words, what's a realistic C-to-ASM instruction ratio? Feel free to make assumptions, such as 'with current x86 architectures'.
I tried to Google about this, but I couldn't find anything.
Addendum: noticing how much confusion this question brought, I feel some need for an explanation: What I wanted to know by this answer, is to know, in practical terms, what "3GHz" means. I am fully aware of that the throughput per Herz varies tremendously depending on the architecture, your hardware, caches, bus speeds, and the position of the moon.
I am not after a precise and scientific answer, but rather an empirical answer that could be put into fathomable scales.
This isn't a trivial answer to place (as I became to notice), and this was my best effort at it. I know that the amount of resulting lines of ASM per lines of C varies depending on what you are doing. i++ is not in the same neighborhood as sqrt(23.1) - I know this. Additionally, no matter what ASM I get out of the C, the ASM is interpreted into various sets of microcode within the processor, which, again, depends on whether you are running AMD, Intel or something else, and their respective generations. I'm aware of this aswell.
The ballpark answers I've got so far are what I have been after: A project large enough averages at about 2 lines of x86 ASM per 1 line of ANSI-C. Today's processors probably would average at about one ASM command per clock cycle, once the pipelines are filled, and given a sample big enough.

There is no answer possible. statements like int a; might require zero asm lines. while statements like a = call_is_inlined(); might require 20+ asm lines.
You can see yourself by compiling a c program, and then starting objdump -Sd ./a.out . It will display asm and C code intermixed, so you can see how many asm lines are generated for one C line. Example:
test.c
int get_int(int c);
int main(void) {
int a = 1, b = 2;
return getCode(a) + b;
}
$ gcc -c -g test.c
$ objdump -Sd ./test.o
00000000 <main>:
int get_int(int c);
int main(void) { /* here, the prologue creates the frame for main */
0: 8d 4c 24 04 lea 0x4(%esp),%ecx
4: 83 e4 f0 and $0xfffffff0,%esp
7: ff 71 fc pushl -0x4(%ecx)
a: 55 push %ebp
b: 89 e5 mov %esp,%ebp
d: 51 push %ecx
e: 83 ec 14 sub $0x14,%esp
int a = 1, b = 2; /* setting up space for locals */
11: c7 45 f4 01 00 00 00 movl $0x1,-0xc(%ebp)
18: c7 45 f8 02 00 00 00 movl $0x2,-0x8(%ebp)
return getCode(a) + b;
1f: 8b 45 f4 mov -0xc(%ebp),%eax
22: 89 04 24 mov %eax,(%esp)
25: e8 fc ff ff ff call 26 <main+0x26>
2a: 03 45 f8 add -0x8(%ebp),%eax
} /* the epilogue runs, returning to the previous frame */
2d: 83 c4 14 add $0x14,%esp
30: 59 pop %ecx
31: 5d pop %ebp
32: 8d 61 fc lea -0x4(%ecx),%esp
35: c3 ret

I'm not sure what you mean by "C-instruction", maybe statement or line? Of course this will vary greatly due to a number of factors but after looking at a few sample programs of my own, many of them are close to the 2-1 mark (2 assembly instructions per LOC), I don't know what this means or how it might be useful.
You can figure this out yourself for any particular program and implementation combination by asking the compiler to generate only the assembly (gcc -S for example) or by using a disassembler on an already compiled executable (but you would need the source code to compare it to anyway).
Edit
Just to expand on this based on your clarification of what you are trying to accomplish (understanding how many lines of code a modern processor can execute in a second):
While a modern processor may run at 3 billion cycles per second that doesn't mean that it can execute 3 billion instructions per second. Here are some things to consider:
Many instructions take multiple cycles to execute (division or floating point operations can take dozens of cycles to execute).
Most programs spend the vast majority of their time waiting for things like memory accesses, disk accesses, etc.
Many other factors including OS overhead (scheduling, system calls, etc.) are also limiting factors.
But in general yes, processors are incredibly fast and can accomplish amazing things in a short period of time.

That varies tremendously! I woudn't believe anyone if they tried to offer a rough conversion.
Statements like i++; can translate to a single INC AX.
Statements for function calls containing many parameters can be dozens of instructions as the stack is setup for the call.
Then add in there the compiler optimization that will assemble your code in a manner different than you wrote it thus eliminating instructions.
Also some instructions run better on machine word boundaries so NOPs will be peppered throughout your code.

I don't think you can conclude anything useful whatsoever about performance of real applications from what you're trying to do here. Unless 'not precise' means 'within several orders of magnitude'.
You're just way overgeneralised, and you're dismissing caching, etc, as though it's secondary, whereas it may well be totally dominant.
If your application is large enough to have trended to some average instructions-per-loc, then it will also be large enough to have I/O or at the very least significant RAM access issues to factor in.

Depending on your environment you could use the visual studio option : /FAs
more here

I am not sure there is really a useful answer to this. For sure you will have to pick the architecture (as you suggested).
What I would do: Take a reasonable sized C program. Give gcc the "-S" option and check yourself. It will generate the assembler source code and you can calculate the ratio for that program yourself.

RISC or CISC? What's an instruction in C, anyway?
Which is to repeat the above points that you really have no idea until you get very specific about the type of code you're working with.
You might try reviewing the academic literature regarding assembly optimization and the hardware/software interference cross-talk that has happened over the last 30-40 years. That's where you're going to find some kind of real data about what you're interested in. (Although I warn you, you might wind up seeing C->PDP data instead of C->IA-32 data).

You wrote in one of the comments that you want to know what 3GHz means.
Even the frequency of the CPU does not matter. Modern PC-CPUs interleave and schedule instructions heavily, they fetch and prefetch, cache memory and instructions and often that cache is invalidated and thrown to the bin. The best interpretation of processing power can be gained by running real world performance benchmarks.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight