I recently discovered that the _mm_crc32_* intel intrinsic instruction could be used to generate (pseudo-)random 32 bit numbers.
#include <nmmintrin.h> /* needs CRC32C instruction from SSE4.2 instruction set extension */
uint32_t rnd = 1; /* initialize with seed != 0 */
/* period length is 4,294,967,295 = 2^32-1 */
while (1) {
#if 0 // this was faster but worse than xorshift32 (fails more tests)
// rnd = _mm_crc32_u8(rnd, rnd >> 3);
#else // this is faster and better than xorshift32 (fails fewer tests)
rnd = _mm_crc32_u32(rnd, rnd << 18);
#endif
printf("%08X\n", rnd);
}
This method is as fast as a LCG, and faster than xorshift32. Wikipedia says that because xorshift generators "fail a few statistical tests, they have been accused of being unreliable".
Now I am wondering if the CRC32C method passes the various tests that are done on random number generators. I only verified that each bit, even the LSB, is "random" by trying to compress using PAQ8 compressors (which failed). Can someone help me do better tests?
EDIT: Using tests from the suggested TestU01 suite, the method that I previously used turned out to be worse than xorshift32. I have updated the source code above in case anyone is interested in using the better version.
This is an interesting question. Ultimately the only test that counts is "does this rng yield correct results for the problem I'm working on". What are you hoping to do with the rng?
In order to avoid answering that question for every different problem, various tests have been devised. See for example the "Diehard" tests devised by George Marsaglia. A web search for "marsaglia random number generator tests" turns up several interesting links.
I think Marsaglia's work is a couple of decades old at this point. I don't know if there has been more work on this topic since then. My guess is that for non-cryptographic purposes, an rng which passes the Diehard tests is probably sufficient.
There's a big difference in the requirements for a PRNG for a video game (especially single-player) vs. a monte-carlo simulation. Small biases can be a problem for scientific numerical computing, but generally not for a game, especially if numbers from the same PRNG are used in different ways.
There's a reason that different PRNGs with different speed / quality tradeoffs exist.
This one is very fast, especially if the seed / state stays in a register, taking only 2 or 3 uops on a modern Intel CPU. So it's fantastic if it can inline into a loop. Compared to anything else of the same speed, it's probably better quality. But compared to something only a little bit slower with larger state, it's probably pathetic if you care about statistical quality.
On x86 with BMI2, each RNG step should only require rorx edx, eax, 3 / crc32 eax, dl. On Haswell/Skylake, that's 2 uops with total latency = 1 + 3 cycles for the loop-carried dependency. (http://agner.org/optimize/). Or 3 uops without BMI2, for mov edx, eax / shr edx,3 / crc32 eax, dl, but still only 4 cycles of latency on CPUs with zero-latency mov for GP registers: Ivybridge+ and Ryzen.
2 uops is negligible impact on surrounding code in the normal case where you do enough work with each PRNG result that the 4-cycle dependency chain isn't a bottleneck. (Or ~9 cycle if your compiler stores/reloads the PRNG state inside the loop instead of keeping it live in a register and sinking the store to the global to after a loop, costing you 2 extra 1-uop instructions).
On Ryzen, crc32 is 3 uops with 3c total latency, so more impact on surrounding code but the same one per 4 clock bottleneck if you're doing so little with the PRNG results that you bottleneck on that.
I suspect you may have been benchmarking the loop-carried dependency chain bottleneck, not the impact on real surrounding code that does enough work to hide that latency. (Almost all relevant x86 CPUs are out-of-order execution.) Making an RNG even cheaper than xorshift128+, or even xorshift128, is likely to be of negligible benefit for most use-cases. xorshift128+ or xorshift128* is fast, and quite good quality for the speed.
If you want lots of PRNG results very quickly, consider using a SIMD xorshift128+ to run two or four generators in parallel (in different elements of XMM or YMM vectors). Especially if you can usefully use a __m256i vector of PRNG results. See AVX/SSE version of xorshift128+, and also this answer where I used it.
Returning the entire state as the RNG result is usually a bad thing, because it means that one value tells you exactly what the next one will be. i.e. 3 is always followed by 1897987234 (fake numbers), never 3 followed by something else. Most statistical quality tests should pick this up, but this might or might not be a problem for any given use-case.
Note that https://en.wikipedia.org/wiki/Xorshift is saying that even xorshift128 fails a few statistical tests. I assume xorshift32 is significantly worse. CRC32c is also based on XOR and shift (but also with bit-reflect and modulo in Galois Field(2)), so it's reasonable to think it might be similar or better in quality.
You say your choice of crc32(rnd, rnd>>3) gives a period of 2^32, and that's the best you can do with a state that small. (Of course rnd++ achieves the same period, so it's not the only measure of quality.) It's likely at least as good as an LCG, but those are not considered high quality, especially if the modulus is 2^32 (so you get it for free from fixed-width integer math).
I tested with a bitmap of previously seen values (NASM source), incrementing a counter until we reach a bitmap entry we've already seen. Indeed, all 2^32 values are seen before coming back to the original value. Since the period is exactly 0x100000000, that rules out any special values that are part of a shorter cycle.
Without memory access to the bitmap, Skylake does indeed run the loop at 4 cycles per iteration, as expected from the latency bottleneck. And rorx/crc32 is just 2 total uops.
One measure of the goodness of a PRNG is the length of the cycle. If that matters for your application, then a CRC-32 as you are using it would not be a good choice, since the cycle is only 232. One consequence is that if you are using more samples than that, which doesn't take very long, your results will repeat. Another is that there is a correlation between successive CRC-32 values, where there is only one possible value that will follow the current value.
Better PRNGs have exponentially longer cycles, and the values returned are smaller than the bits in the state, so that successive values do not have that correlation.
You don't need to use a CRC-32C instruction to be fast. Also you don't need to design your own PRNG, which is fraught with hidden perils. Best to leave that to the professionals. See this work for high-quality, small, and fast random number generators.
Related
This question is very similar to:
SIMD instructions for floating point equality comparison (with NaN == NaN)
Although that question focused on 128 bit vectors and had requirements about identifying +0 and -0.
I had a feeling I might be able to get this one myself but the intel intrinsics guide page seems to be down :/
My goal is to take an array of doubles and to return whether a NaN is present in the array. I am expecting that the majority of the time that there won't be one, and would like that route to have the best performance.
Initially I was going to do a comparison of 4 doubles to themselves, mirroring the non-SIMD approach for NaN detection (i.e. NaN only value where a != a is true). Something like:
data *double = ...
__m256d a, b;
int temp = 0;
//This bit would be in a loop over the array
//I'd probably put a sentinel in and loop over while !temp
a = _mm256_loadu_pd(data);
b = _mm256_cmp_pd(a, a, _CMP_NEQ_UQ);
temp = temp | _mm256_movemask_pd(b);
However, in some of the examples of comparison it looks like there is some sort of NaN detection already going on in addition to the comparison itself. I briefly thought, well if something like _CMP_EQ_UQ will detect NaNs, I can just use that and then I can compare 4 doubles to 4 doubles and magically look at 8 doubles at once at the same time.
__m256d a, b, c;
a = _mm256_loadu_pd(data);
b = _mm256_loadu_pd(data+4);
c = _mm256_cmp_pd(a, b, _CMP_EQ_UQ);
At this point I realized I wasn't quite thinking straight because I might happen to compare a number to itself that is not a NaN (i.e. 3 == 3) and get a hit that way.
So my question is, is comparing 4 doubles to themselves (as done above) the best I can do or is there some other better approach to finding out whether my array has a NaN?
You might be able to avoid this entirely by checking fenv status, or if not then cache block it and/or fold it into another pass over the same data, because it's very low computational intensity (work per byte loaded/stored), so it easily bottlenecks on memory bandwidth. See below.
The comparison predicate you're looking for is _CMP_UNORD_Q or _CMP_ORD_Q to tell you that the comparison is unordered or ordered, i.e. that at least one of the operands is a NaN, or that both operands are non-NaN, respectively. What does ordered / unordered comparison mean?
The asm docs for cmppd list the predicates and have equal or better details than the intrinsics guide.
So yes, if you expect NaN to be rare and want to quickly scan through lots of non-NaN values, you can vcmppd two different vectors against each other. If you cared about where the NaN was, you could do extra work to sort that out once you know that there is at least one in either of two input vectors. (Like _mm256_cmp_pd(a,a, _CMP_UNORD_Q) to feed movemask + bitscan for lowest set bit.)
OR or AND multiple compares per movemask
Like with other SSE/AVX search loops, you can also amortize the movemask cost by combining a few compare results with _mm256_or_pd (find any unordered) or _mm256_and_pd (check for all ordered). E.g. check a couple cache lines (4x _mm256d with 2x _mm256_cmp_pd) per movemask / test/branch. (glibc's asm memchr and strlen use this trick.) Again, this optimizes for your common case where you expect no early-outs and have to scan the whole array.
Also remember that it's totally fine to check the same element twice, so your cleanup can be simple: a vector that loads up to the end of the array, potentially overlapping with elements you already checked.
// checks 4 vectors = 16 doubles
// non-zero means there was a NaN somewhere in p[0..15]
static inline
int any_nan_block(double *p) {
__m256d a = _mm256_loadu_pd(p+0);
__m256d abnan = _mm256_cmp_pd(a, _mm256_loadu_pd(p+ 4), _CMP_UNORD_Q);
__m256d c = _mm256_loadu_pd(p+8);
__m256d cdnan = _mm256_cmp_pd(c, _mm256_loadu_pd(p+12), _CMP_UNORD_Q);
__m256d abcdnan = _mm256_or_pd(abnan, cdnan);
return _mm256_movemask_pd(abcdnan);
}
// more aggressive ORing is possible but probably not needed
// especially if you expect any memory bottlenecks.
I wrote the C as if it were assembly, one instruction per source line. (load / memory-source cmppd). These 6 instructions are all single-uop in the fused-domain on modern CPUs, if using non-indexed addressing modes on Intel. test/jnz as a break condition would bring it up to 7 uops.
In a loop, an add reg, 16*8 pointer increment is another 1 uop, and cmp / jne as a loop condition is one more, bringing it up to 9 uops. So unfortunately on Skylake this bottlenecks on the front-end at 4 uops / clock, taking at least 9/4 cycles to issue 1 iteration, not quite saturating the load ports. Zen 2 or Ice Lake could sustain 2 loads per clock without any more unrolling or another level of vorpd combining.
Another trick that might be possible is to use vptest or vtestpd on two vectors to check that they're both non-zero. But I'm not sure it's possible to correctly check that every element of both vectors is non-zero. Can PTEST be used to test if two registers are both zero or some other condition? shows that the other way (that _CMP_UNORD_Q inputs are both all-zero) is not possible.
But this wouldn't really help: vtestpd / jcc is 3 uops total, vs. vorpd / vmovmskpd / test+jcc also being 3 fused-domain uops on existing Intel/AMD CPUs with AVX, so it's not even a win for throughput when you're branching on the result. So even if it's possible, it's probably break even, although it might save a bit of code size. And wouldn't be worth considering if it takes more than one branch to sort out the all-zeros or mix_zeros_and_ones cases from the all-ones case.
Avoiding work: check fenv flags instead
If your array was the result of computation in this thread, just check the FP exception sticky flags (in MXCSR manually, or via fenv.h fegetexcept) to see if an FP "invalid" exception has happened since you last cleared FP exceptions. If not, I think that means the FPU hasn't produced any NaN outputs and thus there are none in arrays written since then by this thread.
If it is set, you'll have to check; the invalid exception might have been raised for a temporary result that didn't propagate into this array.
Cache blocking:
If/when fenv flags don't let you avoid the work entirely, or aren't a good strategy for your program, try to fold this check into whatever produced the array, or into the next pass that reads it. So you're reusing data while it's already loaded into vector registers, increasing computational intensity. (ALU work per load/store.)
Even if data is already hot in L1d, it will still bottleneck on load port bandwidth: 2 loads per cmppd still bottlenecks on 2/clock load port bandwidth, on CPUs with 2/clock vcmppd ymm (Skylake but not Haswell).
Also worthwhile to align your pointers to make sure you're getting full load throughput from L1d cache, especially if data is sometimes already hot in L1d.
Or at least cache-block it so you check a 128kiB block before running another loop on that same block while it's hot in cache. That's half the size of 256k L2 so your data should still be hot from the previous pass, and/or hot for the next pass.
Definitely avoid running this over a whole multi-megabyte array and paying the cost of getting it into the CPU core from DRAM or L3 cache, then evicting again before another loop reads it. That's worst case computational intensity, paying the cost of getting it into a CPU core's private cache more than once.
I am evaluating a network+rendering workload for my project.
The program continuously runs a main loop:
while (true) {
doSomething()
drawSomething()
doSomething2()
sendSomething()
}
The main loop runs more than 60 times per second.
I want to see the performance breakdown, how much time each procedure takes.
My concern is that if I print the time interval for every entrance and exit of each procedure,
It would incur huge performance overhead.
I am curious what is an idiomatic way of measuring the performance.
Printing of logging is good enough?
Generally: For repeated short things, you can just time the whole repeat loop. (But microbenchmarking is hard; easy to distort results unless you understand the implications of doing that; for very short things, throughput and latency are different, so measure both separately by making one iteration use the result of the previous or not. Also beware that branch prediction and caching can make something look fast in a microbenchmark when it would actually be costly if done one at a time between other work in a larger program.
e.g. loop unrolling and lookup tables often look good because there's no pressure on I-cache or D-cache from anything else.)
Or if you insist on timing each separate iteration, record the results in an array and print later; you don't want to invoke heavy-weight printing code inside your loop.
This question is way too broad to say anything more specific.
Many languages have benchmarking packages that will help you write microbenchmarks of a single function. Use them. e.g. for Java, JMH makes sure the function under test is warmed up and fully optimized by the JIT, and all that jazz, before doing timed runs. And runs it for a specified interval, counting how many iterations it completes. See How do I write a correct micro-benchmark in Java? for that and more.
Beware common microbenchmark pitfalls
Failure to warm up code / data caches and stuff: page faults within the timed region for touching new memory, or code / data cache misses, that wouldn't be part of normal operation. (Example of noticing this effect: Performance: memset; or example of a wrong conclusion based on this mistake)
Never-written memory (obtained fresh from the kernel) gets all its pages copy-on-write mapped to the same system-wide physical page (4K or 2M) of zeros if you read without writing, at least on Linux. So you can get cache hits but TLB misses. e.g. A large allocation from new / calloc / malloc, or a zero-initialized array in static storage in .bss. Use a non-zero initializer or memset.
Failure to give the CPU time to ramp up to max turbo: modern CPUs clock down to idle speeds to save power, only clocking up after a few milliseconds. (Or longer depending on the OS / HW).
related: on modern x86, RDTSC counts reference cycles, not core clock cycles, so it's subject to the same CPU-frequency variation effects as wall-clock time.
Most integer and FP arithmetic asm instructions (except divide and square root which are already slower than others) have performance (latency and throughput) that doesn't depend on the actual data. Except for subnormal aka denormal floating point being very slow, and in some cases (e.g. legacy x87 but not SSE2) also producing NaN or Inf can be slow.
On modern CPUs with out-of-order execution, some things are too short to truly time meaningfully, see also this. Performance of a tiny block of assembly language (e.g. generated by a compiler for one function) can't be characterized by a single number, even if it doesn't branch or access memory (so no chance of mispredict or cache miss). It has latency from inputs to outputs, but different throughput if run repeatedly with independent inputs is higher. e.g. an add instruction on a Skylake CPU has 4/clock throughput, but 1 cycle latency. So dummy = foo(x) can be 4x faster than x = foo(x); in a loop. Floating-point instructions have higher latency than integer, so it's often a bigger deal. Memory access is also pipelined on most CPUs, so looping over an array (address for next load easy to calculate) is often much faster than walking a linked list (address for next load isn't available until the previous load completes).
Obviously performance can differ between CPUs; in the big picture usually it's rare for version A to be faster on Intel, version B to be faster on AMD, but that can easily happen in the small scale. When reporting / recording benchmark numbers, always note what CPU you tested on.
Related to the above and below points: you can't "benchmark the * operator" in C in general, for example. Some use-cases for it will compile very differently from others, e.g. tmp = foo * i; in a loop can often turn into tmp += foo (strength reduction), or if the multiplier is a constant power of 2 the compiler will just use a shift. The same operator in the source can compile to very different instructions, depending on surrounding code.
You need to compile with optimization enabled, but you also need to stop the compiler from optimizing away the work, or hoisting it out of a loop. Make sure you use the result (e.g. print it or store it to a volatile) so the compiler has to produce it. For an array, volatile double sink = output[argc]; is a useful trick: the compiler doesn't know the value of argc so it has to generate the whole array, but you don't need to read the whole array or even call an RNG function. (Unless the compiler aggressively transforms to only calculate the one output selected by argc, but that tends not to be a problem in practice.)
For inputs, use a random number or argc or something instead of a compile-time constant so your compiler can't do constant-propagation for things that won't be constants in your real use-case. In C you can sometimes use inline asm or volatile for this, e.g. the stuff this question is asking about. A good benchmarking package like Google Benchmark will include functions for this.
If the real use-case for a function lets it inline into callers where some inputs are constant, or the operations can be optimized into other work, it's not very useful to benchmark it on its own.
Big complicated functions with special handling for lots of special cases can look fast in a microbenchmark when you run them repeatedly, especially with the same input every time. In real life use-cases, branch prediction often won't be primed for that function with that input. Also, a massively unrolled loop can look good in a microbenchmark, but in real life it slows everything else down with its big instruction-cache footprint leading to eviction of other code.
Related to that last point: Don't tune only for huge inputs, if the real use-case for a function includes a lot of small inputs. e.g. a memcpy implementation that's great for huge inputs but takes too long to figure out which strategy to use for small inputs might not be good. It's a tradeoff; make sure it's good enough for large inputs (for an appropriate definition of "enough"), but also keep overhead low for small inputs.
Litmus tests:
If you're benchmarking two functions in one program: if reversing the order of testing changes the results, your benchmark isn't fair. e.g. function A might only look slow because you're testing it first, with insufficient warm-up. example: Why is std::vector slower than an array? (it's not, whichever loop runs first has to pay for all the page faults and cache misses; the 2nd just zooms through filling the same memory.)
Increasing the iteration count of a repeat loop should linearly increase the total time, and not affect the calculated time-per-call. If not, then you have non-negligible measurement overhead or your code optimized away (e.g. hoisted out of the loop and runs only once instead of N times).
Vary other test parameters as a sanity check.
For C / C++, see also Simple for() loop benchmark takes the same time with any loop bound where I went into some more detail about microbenchmarking and using volatile or asm to stop important work from optimizing away with gcc/clang.
I have a C program that has n multiplications (single multiplication with n iterations) and I found another logic that has n/2 iterations of (1 multiplication + 2 additions). I know about the complexity that both are of O(n). but in terms of CPU cycles. which is faster ?
Test on your computer. Or, look at the specs for your processor and guess.
The old logic no longer applies: on modern processors, an integer multiplication might be very cheap, on some newish Intel processors it's 3 clock cycles. Additions are 1 cycle on these same processors. However, in a modern pipelined processor, the stalls created by data dependencies might cause additions to take longer.
My guess is that N additions + N/2 multiplications is slower than N multiplications if you are doing a fold type operation, and I would guess the reverse for a map type operation. But this is only a guess.
Test if you want the truth.
However: Most algorithms this simple are memory-bound, and both will be the same speed.
First of all follow Dietrich Epp's first advice - measuring is (at least for complex optimization problems) the only way to be sure.
Now if you want to figure out why one is faster than the other, we can try. There are two different important performance measures: Latency and reciprocal throughput. A short summary of the two:
Latency: This is the delay that the instruction generates in a
dependency chain. The numbers are minimum values. Cache misses,
misalignment, and exceptions may increase the clock counts
considerably. Where hyperthreading is enabled, the use of the same
execution units in the other thread leads to inferior performance.
Denormal numbers, NAN’s and infinity do not increase the latency. The
time unit used is core clock cycles, not the reference clock cycles
given by the time stamp counter.
Reciprocal throughput: The average number of core clock cycles per
instruction for a series of independent instructions of the same kind
in the same thread.
For Sandy bridge the rec. throughput for an add r, r/i (for further notice r=register, i=immediate, m=memory) is 0.33 while the latency is 1.
An imul r, r has a latency of 3 and a rec. throughput of 1.
So as you see it completely depends on your specific algorithm - if you can just replace one imul with two independent adds this particular part of your algorithm could get a theoretical speedup of 50% (and in the best case obviously a speedup of ~350%). But on the other hand if your adds add a problematic dependency one imul could be just as fast as one add.
Also note that we've ignored all the additional complications like memory and cache behavior (things which will generally have a much, MUCH larger influence on the execution time) or intricate stuff like µop fusion and whatnot. In general the only people that should care about this stuff are compiler writers - it's much simpler to just measure the result of their efforts ;)
Anyways if you want a good listing of this stuff see this here (the above description of latency/rec. throughput is also from that particular document).
Is there a way using C or assembler or maybe even C# to get an accurate measure of how long it takes to execute a ADD instruction?
Yes, sort of, but it's non-trivial and produces results that are almost meaningless, at least on most reasonably modern processors.
On relatively slow processors (e.g., up through the original Pentium in the Intel line, still true on most small embedded processors) you can just look in the processor's data sheet and it'll (normally) tell you how many clock ticks to expect. Quick, simple, and easy.
On a modern desktop machine (e.g., Pentium Pro or newer), life isn't nearly that simple. These CPUs can execute a number of instructions at a time, and execute them out of order as long as there aren't any dependencies between them. This means the whole concept of the time taken by a single instruction becomes almost meaningless. The time taken to execute one instruction can and will depend on the instructions that surround it.
That said, yes, if you really want to, you can (usually -- depending on the processor) measure something, though it's open to considerable question exactly how much it'll really mean. Even getting a result like this that's only close to meaningless instead of completely meaningless isn't trivial though. For example, on an Intel or AMD chip, you can use RDTSC to do the timing measurement itself. That, unfortunately, can be executed out of order as described above. To get meaningful results, you need to surround it by an instruction that can't be executed out of order (a "serializing instruction"). The most common choice for that is CPUID, since it's one of the few serializing instructions that's available to "user mode" (i.e., ring 3) programs. That adds a bit of a twist itself though: as documented by Intel, the first few times the processor executes CPUID, it can take longer than subsequent times. As such, they recommend that you execute it three times before you use it to serialize your timing. Therefore, the general sequence runs something like this:
.align 16
CPUID
CPUID
CPUID
RDTSC
; sequence under test
Add eax, ebx
; end of sequence under test
CPUID
RDTSC
Then you compare that to a result from doing the same, but with the sequence under test removed. That's leaving out quite a fe details, of course -- at minimum you need to:
set the registers up correctly before each CPUID
save the value in EAX:EDX after the first RDTSC
subtract result from the second RDTSC from the first
Also note the "align" directive I've inserted -- instruction alignment can and will affect timing as well, especially if a loop is involved.
Construct a loop that executes 10 million times, with nothing in the loop body, and time that. Keep that time as the overhead required for looping.
Then execute the same loop again, this time with the code under test in the body. Time for this loop, minus the overhead (from the empty loop case) is the time due to the 10 million repetitions of your code under test. So, divide by the number of iterations.
Obviously this method needs tuning with regard to the number of iterations. If what you're measuring is small, like a single instruction, you might even want to run upwards of a billion iterations. If its a significant chunk of code, a few 10's of thousands might suffice.
In the case of a single assembly instruction, the assembler is probably the right tool for the job, or perhaps C, if you are conversant with inline assembly. Others have posted more elegant solutions for how to get a measurement w/o the repetition, but the repetition technique is always available, for example, an embedded processor that doesn't have the nice timing instructions mentioned by others.
Note however, that on modern pipeline processors, instruction level parallelism may confound your results. Because more than one instruction is running through the execution pipeline at a time, it is no longer true that N repetitions of an given instruction take N times as long as a single one.
Okay, the problem that you are going to encounter if you are using an OS like Windows, Linux, Unix, MacOS, AmigaOS and all those others that there are lots of processes already running on your machine in the background which will impact performance. The only real way of calculating actual time of an instruction is to disassemble your motherboard and test each component using external hardware. It depends whether you absolutely want to do this yourself, or simply find out how fast a typical revision of your processor actually runs. Companies such as Intel and Motorola test their chips extensively before release, and these results are available to the public. All you need to do is ask them and they'll send you a free CD-ROM (it might be a DVD - nonsense pedantry) with the results contained. You can do it yourself, but be warned that especially Intel processors contain many redundant instructions that are no longer desirable, let alone necessary. This will take up a lot of your time, but I can absolutely see the fun in doing this. PS. If its purely to help push your own machine's hardware to its theoretical maximum in a personal project that you're doing the Just Jeff's answer above is excellent for generating tidy instruction-speed-averages under real-world conditions.
No, but you can calculate it based upon the number of clock cycles the add instruction requires multiplied by the clock rate of the CPU. Different types of arguments to an ADD may result in more or fewer cycles but, for a given argument list, the instruction always takes the same number of cycles to complete.
That said, why do you care?
In trying to figure out whether or not my code's inner loop is hitting a hardware design barrier or a lack of understanding on my part barrier. There's a bit more to it, but the simplest question I can come up with to answer is as follows:
If I have the following code:
float px[32768],py[32768],pz[32768];
float xref, yref, zref, deltax, deltay, deltaz;
initialize_with_random(px);
initialize_with_random(py);
initialize_with_random(pz);
for(i=0;i<32768-1;i++) {
xref=px[i];
yref=py[i];
zref=pz[i];
for(j=0;j<32768-1;j++ {
deltx=xref-px[j];
delty=yref-py[j];
deltz=zref-pz[j];
} }
What type of maximum theoretical speed up would I be able to see by going to SSE instructions in a situation where I have complete control over code (assembly, intrinsics, whatever) but no control over runtime environment other than architecture (i.e. it's a multi-user environment so I can't do anything about how the OS kernel assigns time to my particular process).
Right now I'm seeing a speed up of 3x with my code, when I would have thought using SSE would give me much more vector depth than the 3x speed up is indicating (presumably the 3x speed up tells me I have a 4x maximum theoretical throughput). (I've tried things such as letting deltx/delty/deltz be arrays in case the compiler wasn't smart enough to auto-promote them, but I still see only 3x speed up.) I'm using the intel C compiler with the appropriate compiler flags for vectorization, but no intrinsics obviously.
It depends on the CPU. But the theoretical max won't get above 4x. I don't know of a CPU which can execute more than one SSE instruction per clock cycle, which means that it can at most compute 4 values per cycle.
Most CPU's can do at least one floating point scalar instruction per cycle, so in this case you'd see a theoretical max of a 4x speedup.
But you'll have to look up the specific instruction throughput for the CPU you're running on.
A practical speedup of 3x is pretty good though.
I think you'd probably have to interleave the inner loop somehow. The 3-component vector is getting done at once, but that's only 3 operations at once. To get to 4, you'd do 3 components from the first vector, and 1 from the next, then 2 and 2, and so on. If you established some kind of queue that loads and processes the data 4 components at a time, then separate it after, that might work.
Edit: You could unroll the inner loop to do 4 vectors per iteration (assuming the array size is always a multiple of 4). That would accomplish what I said above.
Consider: How wide is a float? How wide is the SSEx instruction? The ratio should should give you some kind of reasonable upper bound.
It's also worth noting that out-of-order pipes play havok with getting good estimates of speedup.
You should consider loop tiling - the way you are accessing values in the inner loop is probably causing a lot of thrashing in the L1 data cache. It's not too bad, because everything probably still fits in the L2 at 384 KB, but there is easily an order of magnitude difference between an L1 cache hit and an L2 cache hit, so this could make a big difference for you.