I have a C program that has n multiplications (single multiplication with n iterations) and I found another logic that has n/2 iterations of (1 multiplication + 2 additions). I know about the complexity that both are of O(n). but in terms of CPU cycles. which is faster ?
Test on your computer. Or, look at the specs for your processor and guess.
The old logic no longer applies: on modern processors, an integer multiplication might be very cheap, on some newish Intel processors it's 3 clock cycles. Additions are 1 cycle on these same processors. However, in a modern pipelined processor, the stalls created by data dependencies might cause additions to take longer.
My guess is that N additions + N/2 multiplications is slower than N multiplications if you are doing a fold type operation, and I would guess the reverse for a map type operation. But this is only a guess.
Test if you want the truth.
However: Most algorithms this simple are memory-bound, and both will be the same speed.
First of all follow Dietrich Epp's first advice - measuring is (at least for complex optimization problems) the only way to be sure.
Now if you want to figure out why one is faster than the other, we can try. There are two different important performance measures: Latency and reciprocal throughput. A short summary of the two:
Latency: This is the delay that the instruction generates in a
dependency chain. The numbers are minimum values. Cache misses,
misalignment, and exceptions may increase the clock counts
considerably. Where hyperthreading is enabled, the use of the same
execution units in the other thread leads to inferior performance.
Denormal numbers, NAN’s and infinity do not increase the latency. The
time unit used is core clock cycles, not the reference clock cycles
given by the time stamp counter.
Reciprocal throughput: The average number of core clock cycles per
instruction for a series of independent instructions of the same kind
in the same thread.
For Sandy bridge the rec. throughput for an add r, r/i (for further notice r=register, i=immediate, m=memory) is 0.33 while the latency is 1.
An imul r, r has a latency of 3 and a rec. throughput of 1.
So as you see it completely depends on your specific algorithm - if you can just replace one imul with two independent adds this particular part of your algorithm could get a theoretical speedup of 50% (and in the best case obviously a speedup of ~350%). But on the other hand if your adds add a problematic dependency one imul could be just as fast as one add.
Also note that we've ignored all the additional complications like memory and cache behavior (things which will generally have a much, MUCH larger influence on the execution time) or intricate stuff like µop fusion and whatnot. In general the only people that should care about this stuff are compiler writers - it's much simpler to just measure the result of their efforts ;)
Anyways if you want a good listing of this stuff see this here (the above description of latency/rec. throughput is also from that particular document).
Related
I recently discovered that the _mm_crc32_* intel intrinsic instruction could be used to generate (pseudo-)random 32 bit numbers.
#include <nmmintrin.h> /* needs CRC32C instruction from SSE4.2 instruction set extension */
uint32_t rnd = 1; /* initialize with seed != 0 */
/* period length is 4,294,967,295 = 2^32-1 */
while (1) {
#if 0 // this was faster but worse than xorshift32 (fails more tests)
// rnd = _mm_crc32_u8(rnd, rnd >> 3);
#else // this is faster and better than xorshift32 (fails fewer tests)
rnd = _mm_crc32_u32(rnd, rnd << 18);
#endif
printf("%08X\n", rnd);
}
This method is as fast as a LCG, and faster than xorshift32. Wikipedia says that because xorshift generators "fail a few statistical tests, they have been accused of being unreliable".
Now I am wondering if the CRC32C method passes the various tests that are done on random number generators. I only verified that each bit, even the LSB, is "random" by trying to compress using PAQ8 compressors (which failed). Can someone help me do better tests?
EDIT: Using tests from the suggested TestU01 suite, the method that I previously used turned out to be worse than xorshift32. I have updated the source code above in case anyone is interested in using the better version.
This is an interesting question. Ultimately the only test that counts is "does this rng yield correct results for the problem I'm working on". What are you hoping to do with the rng?
In order to avoid answering that question for every different problem, various tests have been devised. See for example the "Diehard" tests devised by George Marsaglia. A web search for "marsaglia random number generator tests" turns up several interesting links.
I think Marsaglia's work is a couple of decades old at this point. I don't know if there has been more work on this topic since then. My guess is that for non-cryptographic purposes, an rng which passes the Diehard tests is probably sufficient.
There's a big difference in the requirements for a PRNG for a video game (especially single-player) vs. a monte-carlo simulation. Small biases can be a problem for scientific numerical computing, but generally not for a game, especially if numbers from the same PRNG are used in different ways.
There's a reason that different PRNGs with different speed / quality tradeoffs exist.
This one is very fast, especially if the seed / state stays in a register, taking only 2 or 3 uops on a modern Intel CPU. So it's fantastic if it can inline into a loop. Compared to anything else of the same speed, it's probably better quality. But compared to something only a little bit slower with larger state, it's probably pathetic if you care about statistical quality.
On x86 with BMI2, each RNG step should only require rorx edx, eax, 3 / crc32 eax, dl. On Haswell/Skylake, that's 2 uops with total latency = 1 + 3 cycles for the loop-carried dependency. (http://agner.org/optimize/). Or 3 uops without BMI2, for mov edx, eax / shr edx,3 / crc32 eax, dl, but still only 4 cycles of latency on CPUs with zero-latency mov for GP registers: Ivybridge+ and Ryzen.
2 uops is negligible impact on surrounding code in the normal case where you do enough work with each PRNG result that the 4-cycle dependency chain isn't a bottleneck. (Or ~9 cycle if your compiler stores/reloads the PRNG state inside the loop instead of keeping it live in a register and sinking the store to the global to after a loop, costing you 2 extra 1-uop instructions).
On Ryzen, crc32 is 3 uops with 3c total latency, so more impact on surrounding code but the same one per 4 clock bottleneck if you're doing so little with the PRNG results that you bottleneck on that.
I suspect you may have been benchmarking the loop-carried dependency chain bottleneck, not the impact on real surrounding code that does enough work to hide that latency. (Almost all relevant x86 CPUs are out-of-order execution.) Making an RNG even cheaper than xorshift128+, or even xorshift128, is likely to be of negligible benefit for most use-cases. xorshift128+ or xorshift128* is fast, and quite good quality for the speed.
If you want lots of PRNG results very quickly, consider using a SIMD xorshift128+ to run two or four generators in parallel (in different elements of XMM or YMM vectors). Especially if you can usefully use a __m256i vector of PRNG results. See AVX/SSE version of xorshift128+, and also this answer where I used it.
Returning the entire state as the RNG result is usually a bad thing, because it means that one value tells you exactly what the next one will be. i.e. 3 is always followed by 1897987234 (fake numbers), never 3 followed by something else. Most statistical quality tests should pick this up, but this might or might not be a problem for any given use-case.
Note that https://en.wikipedia.org/wiki/Xorshift is saying that even xorshift128 fails a few statistical tests. I assume xorshift32 is significantly worse. CRC32c is also based on XOR and shift (but also with bit-reflect and modulo in Galois Field(2)), so it's reasonable to think it might be similar or better in quality.
You say your choice of crc32(rnd, rnd>>3) gives a period of 2^32, and that's the best you can do with a state that small. (Of course rnd++ achieves the same period, so it's not the only measure of quality.) It's likely at least as good as an LCG, but those are not considered high quality, especially if the modulus is 2^32 (so you get it for free from fixed-width integer math).
I tested with a bitmap of previously seen values (NASM source), incrementing a counter until we reach a bitmap entry we've already seen. Indeed, all 2^32 values are seen before coming back to the original value. Since the period is exactly 0x100000000, that rules out any special values that are part of a shorter cycle.
Without memory access to the bitmap, Skylake does indeed run the loop at 4 cycles per iteration, as expected from the latency bottleneck. And rorx/crc32 is just 2 total uops.
One measure of the goodness of a PRNG is the length of the cycle. If that matters for your application, then a CRC-32 as you are using it would not be a good choice, since the cycle is only 232. One consequence is that if you are using more samples than that, which doesn't take very long, your results will repeat. Another is that there is a correlation between successive CRC-32 values, where there is only one possible value that will follow the current value.
Better PRNGs have exponentially longer cycles, and the values returned are smaller than the bits in the state, so that successive values do not have that correlation.
You don't need to use a CRC-32C instruction to be fast. Also you don't need to design your own PRNG, which is fraught with hidden perils. Best to leave that to the professionals. See this work for high-quality, small, and fast random number generators.
We had to implement an ASM program for multiplying sparse matrices in the coordinate scheme format (COOS) as well as in the compressed row format (CSR). Now that we have implemented all these algorithms we want to know how much more performant they are in contrast to the usual matrix multiplication. We already implemented code to measure the running time of all these algorithms but now we decided that we also want to know how many floating points operations per seconds (FLOPS) we can perform.
Any suggestion of how to measure/count this?
Here some background information on the used system:
processor : 0
model name : ARMv7 Processor rev 2 (v7l)
Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x3
CPU part : 0xc08
CPU revision : 2
Our first idea was now to implement a kind of FPO counter which we increment after each floating point operation (Arithmetic operations as well as comparison and move operations), but that would mean that we have to insert increment operations all over our code which also slows down the application ...
Does anyone know if there is some kind of hardware counter which counts the number of floating point operations or maybe if there exist some kind of performance tool which can be used to monitor our program and measures the number of FPOs.
Any suggestions or pointers would be appreciated.
Here is the evaluation of the FLOPs for a matrix multiplication by using the counting approach. We first measured the running time than inserted counters for each instruction we were interested in and after that we calculated the number of floating point operations per second.
It looks like the closest you can get with the performance events supported by Cortex-A8 is a count of total instructions executed, which isn't very helpful given that "an instruction" performs anything from 0 to (I think) 8 FP operations. Taking a step back, it becomes apparent that trying to measure FLOPS for the algorithm in hardware wouldn't really work anyway - e.g. you could write an implementation using vector ops but not always put real data in all lanes of each vector, then the CPU needs to be psychic to know how many of the FP operations it's executing actually count.
Fortunately, given a formal definition of an algorithm, calculating the number of operations involved should be fairly straightforward (although not necessarily easy, depending on the complexity). For instance, running through it in my head, the standard naïve multiplication of an m x n matrix with an n x m matrix comes out to m * m * (n + n - 1) operations (n multiplications and (n - 1) additions per output element). Once on-paper analysis has come up with an appropriately parameterised op-counting formula, you can plumb that into your benchmarking tool to calculate numbers for the data on test.
Once you've done all that, you'll probably then start regretting spending all the time to do it, because what you'll have is (arbitrary number) / (execution time) which is little more meaningful than (execution time) alone, and mostly just complicates comparison between cases where (arbitrary number) differs. NEON performance in particular is dominated by pipeline latency and memory bandwidth, and as such the low-level implementation details could easily outweigh any inherent difference the algorithms might have.
Think of it this way: say on some given 100MHz CPU a + a + b + b takes 5 cycles total, while (a + b) * 2 takes 4 cycles total* - the former scores 60 MFLOPS, the latter only 50 MFLOPS. Are you going to say that more FLOPS means better performance, in which case the routine which takes 25% longer to give the same result is somehow "better"? Are you going to say fewer FLOPS means better performance, which is clearly untrue for any reasonable interpretation? Or are you going to conclude that FLOPS is pretty much meaningless for anything other than synthetic benchmarks to compare the theoretical maximum bandwidth of one CPU with another?
* numbers pulled out of thin air for the sake of argument; however they're actually not far off something like Cortex-M4F - a single-precision FPU where both add and multiply are single-cycle, plus one or two for register hazards.
Number of Cores x Average frequency x Operations percycle
I'm using an OMAP L138 processor at the moment which does not have a hardware FPU. We will be processing spectral data using algorithms that are FP intensive thus the ARM side won't be adequate. I'm not the algorithm person but one is "Dynamic Time Warping" (I don't know what it means, no). The initial performance numbers are:
Core i7 Laptop# 2.9GHz: 1 second
Raspberry Pi ARM1176 # 700MHz: 12 seconds
OMAP L138 ARM926 # 300MHz: 193 seconds
Worse, the Pi is about 30% of the price of the board I'm using!
I do have a TI C674x which is the other processor in the OMAP L138. The question is would I be best served by spending many weeks trying to:
learn the DSPLINK, interop libraries and toolchain not to mention forking out for the large cost of Code Composer or
throwing the L138 out and moving to a Dual Cortex A9 like the Pandaboard, possibly suffering power penalties in the process.
(When I look at FPU performance on the A8, it isn't an improvement over the Rasp Pi but Cortex A9 seems to be).
I understand the answer is "it depends". Others here have said that "you unlock an incredible fast DSP that can easily outperform the Cortex-A8 if assigned the right job" but for a defined job set would I be better off skipping to the A9, even if I had to buy an external DSP later?
That question can't be answered without knowing the clock-rates of DSP and the ARM.
Here is some background:
I just checked the cycles of a floating point multiplication on the c674x DSP:
It can issue two multiplications per cycle, and each multiplication has a result latency of three cycles (that means you have to wait three additional cycles before the result appears in the destination register).
You can however start two multiplications each cycle because the DSP will not wait for the result. The compiler/assembler will do the required scheduling for you.
That only uses two of the available eight functional units of the DSP, so while you do the two multiplications you can per cycle also do:
two load/stores (64 bit wide)
six floating point add/subtract instructions (or integer instructions)
Loop control and branching is free and does not cost you anything on the DSP.
That makes a total of six floating point operations per cycle with parallel loads/stores and loop control.
ARM-NEON on the other hand can, in floating point mode:
Issue two multiplications per cycle. Latency is comparable, and the instructions are also pipeline-able like on the DSP. Loading/storing takes extra time as does add/subtract stuff. Loop control and branching will very likely go for free in well written code.
So in summary the DSP does three times as much work per cycle as the Cortex-A9 NEON unit.
Now you can check the clock-rates of DSP and the ARM and see what is faster for your job.
Oh, one thing: With well-written DSP code you will almost never see a cache miss during loads because you move the data from RAM to the cache using DMA before you access the data. This gives impressive speed advantages for the DSP as well.
It does depend on the application but, generally speaking, it is rare these days for special purpose processors to beat general-purpose processors. General purpose processors now have have higher clock rates and multimedia acceleration. Even for a numerically intensive algorithm where a DSP may have an edge, the increased engineering complexity of dealing with a heterogeneous multi-processor environment makes this type of solution problematic from an ROI perspective.
We are planning to port a big part of our Digital Signal Processing routines from hardware-specific chips to the common desktop CPU architecture like Quad-Core or so. I am trying to estimate the limits of such architecture for a program build with GCC. I am mostly interested in a high SDRAM-CPU bandwidth [Gb/sec] and in a high number of the 32-Bit IEEE-754 floating point Multiply-Accumulate operations per second.
I have selected a typical representative of the modern desktop CPUs - Quad Core, about 10Mb cache, 3GHz, 45nm. Can you please help me to find out its limits:
1) Highest possible Multiply-Accumulate operations per second if CPU's specific instructions which GCC supports using input flags will be used and all cores will be used. The source code itself must not require changes if we decide to port it to the different CPU-architecture like Altivec on PowerPC - the best option is to use GCC flags like -msse or -maltivec. I suggest also, a program has to have 4 threads in order to utilize all available cores, right?
2) SDRAM-CPU bandwidth (highest limit, so indep. on the mainboard).
UPDATE: Since GCC 3, GCC can automatically generate SSE/SSE2 scalar code when the target supports those instructions. Automatic vectorization for SSE/SSE2 has been added since GCC 4. SSE4.1 introduces DPPS, DPPD instructions - Dot product for Array of Structs data. New 45nm Intel processors support SSE4 instructions.
First off, know that it will most likely not be possible for your code to both run as fast as possible on modern vector FPU units and be completely portable across architectures. It is possible to abstract away some aspects of the architectures via macros, etc, but compilers are (at present) capable of generating nearly optimal auto-vectorized code only for very simple programs.
Now, on to your questions: current x86 hardware does not have a multiply-accumulate, but is capable of one vector add and one vector multiply per cycle per core. Assuming that your code achieves full computational density, and you either hand-write vector code or your code is simple enough for the compiler to handle the task, the peak throughput that can be achieved independent of memory access latency is:
number of cores * cycles per second * flops per cycle * vector width
Which in your case sounds like:
4 * 3.2 GHz * 2 vector flops/cycle * 4 floats/vector = 102.4 Gflops
If you are going to write scalar code, divide that by four. If you are going to write vector code in C with some level of portable abstraction, plan to be leaving some performance on the table, but you can certainly go substantially faster than scalar code will allow. 50% of theoretical peak is a conservative guess (I would expect to do better assuming the algorithms are amenable to vectorization, but make sure you have some headroom in your estimates).
edit: notes on DPPS:
DPPS is not a multiply-add, and using it as one is a performance hazard on current architectures. Looking it up in the Intel Optimization Manual, you will find that it has a latency of 11 cycles, and throughput is only one vector result every two cycles. DPPS does up to four multiplies and three adds, so you're getting 2 multiplies per cycle and 1.5 adds, whereas using MULPS and ADDPS would get you 4 of each every cycle.
More generally, horizontal vector operations should be avoided unless absolutely necessary; lay out your data so that your operations stay within vector lanes to the maximum extent possible.
In fairness to Intel, if you can't change your data layout, and DPPS happens to be exactly the operation that you need, then you want to use it. Just be aware that you're limiting yourself to less than 50% of peak FP throughput right off the bat by doing so.
This may not directly answer your question, but have you considered using the PC's graphics cards for parallel floating-point computations? It's getting to the point where GPUs will outperform CPUs for some tasks; and the nice thing is that graphics cards are reasonably competitively priced.
I'm short on details, sorry; this is just to give you an idea.
Some points you should consider:
1) Intel's i7-architecture is in the moment your fastest options for 1 or 2 CPUs. Only for 4 or more sockets AMD's Opterons can compete.
2) Intel's compilers generate code that is often significantly faster that code generated by other compilers (when used on AMD's CPUs you have to patch away some CPU checks Intel puts in to prevent AMD to look good).
3) No x86-CPU supports multiply-and-add yet, AMD's next architecure "Bulldozer" will probably be the first to support it.
4) High memory bandwidth you get on any AMD CPU and on Intel only for the new i7-architecture (socket 1366 is better than 775).
5) Use Intel's highly efficient libraries
if possible.
In trying to figure out whether or not my code's inner loop is hitting a hardware design barrier or a lack of understanding on my part barrier. There's a bit more to it, but the simplest question I can come up with to answer is as follows:
If I have the following code:
float px[32768],py[32768],pz[32768];
float xref, yref, zref, deltax, deltay, deltaz;
initialize_with_random(px);
initialize_with_random(py);
initialize_with_random(pz);
for(i=0;i<32768-1;i++) {
xref=px[i];
yref=py[i];
zref=pz[i];
for(j=0;j<32768-1;j++ {
deltx=xref-px[j];
delty=yref-py[j];
deltz=zref-pz[j];
} }
What type of maximum theoretical speed up would I be able to see by going to SSE instructions in a situation where I have complete control over code (assembly, intrinsics, whatever) but no control over runtime environment other than architecture (i.e. it's a multi-user environment so I can't do anything about how the OS kernel assigns time to my particular process).
Right now I'm seeing a speed up of 3x with my code, when I would have thought using SSE would give me much more vector depth than the 3x speed up is indicating (presumably the 3x speed up tells me I have a 4x maximum theoretical throughput). (I've tried things such as letting deltx/delty/deltz be arrays in case the compiler wasn't smart enough to auto-promote them, but I still see only 3x speed up.) I'm using the intel C compiler with the appropriate compiler flags for vectorization, but no intrinsics obviously.
It depends on the CPU. But the theoretical max won't get above 4x. I don't know of a CPU which can execute more than one SSE instruction per clock cycle, which means that it can at most compute 4 values per cycle.
Most CPU's can do at least one floating point scalar instruction per cycle, so in this case you'd see a theoretical max of a 4x speedup.
But you'll have to look up the specific instruction throughput for the CPU you're running on.
A practical speedup of 3x is pretty good though.
I think you'd probably have to interleave the inner loop somehow. The 3-component vector is getting done at once, but that's only 3 operations at once. To get to 4, you'd do 3 components from the first vector, and 1 from the next, then 2 and 2, and so on. If you established some kind of queue that loads and processes the data 4 components at a time, then separate it after, that might work.
Edit: You could unroll the inner loop to do 4 vectors per iteration (assuming the array size is always a multiple of 4). That would accomplish what I said above.
Consider: How wide is a float? How wide is the SSEx instruction? The ratio should should give you some kind of reasonable upper bound.
It's also worth noting that out-of-order pipes play havok with getting good estimates of speedup.
You should consider loop tiling - the way you are accessing values in the inner loop is probably causing a lot of thrashing in the L1 data cache. It's not too bad, because everything probably still fits in the L2 at 384 KB, but there is easily an order of magnitude difference between an L1 cache hit and an L2 cache hit, so this could make a big difference for you.