Calculating Floating point Operations Per Second(FLOPS) and Integer Operations Per Second(IOPS) - benchmarking

I am trying to learn some basic benchmarking. I have a loop in my Java program like,
float a=6.5f;
int b=3;
for(long j=0; j<999999999; j++){
var = a*b+(a/b);
}//end of for
My processor takes around 0.431635 second to process this. How would I calculate processor speed in terms of Flops(Floating point Operations Per Second) and Iops(Integer Operations Per Second)? Can you provide explanations with some steps?

You have a single loop with 999999999 iterations: lets call this 1e9 (one billion) for simplicity. The integers will get promoted to floats in the calculations that involve both, so the loop contains 3 floating-point operations: one mult, one add, and one div, so there are 3e9. This takes 0.432s, so you're apparently getting about 6.94 GFLOP/s (3e9/0.432). Similarly, you are doing 1 integer op (j++) per loop iteration, so you are getting 1e9/0.432 or about 2.32 GIOP/s.
However, the calculation a*b+(a/b) is loop-invariant, so it would be pretty surprising if this didn't get optimized away. I don't know much about Java, but any C compiler will evaluate this at compile-time, remove the a and b variables and the loop, and (effectively) replace the whole lot with var=21.667;. This is a very basic optimization, so I'd be surprised if javac didn't do it too.
I have no idea what's going on under the hood in Java, but I'd be suspicious of getting 7 GFLOPs. Modern Intel CPUs (I'm assuming that's what you've got) are, in principle, capable of two vector arithmetic ops per clock cycle with the right instruction mix (one add and one mult per cycle), so for a 3 GHz 4-core CPU, it's even possible to get 3e9*4*8 = 96 single-precision GFLOPs under ideal conditions. The various mul and add instructions have a reciprocal throughput of 1 cycle, but the div takes more than ten times as long, so I'd be very suspicious of getting more than about CLK/12 FLOPs (scalar division on a single core) once division is involved: if the compiler is smart enough to vectorize and/or parallelize the code to get more than that, which it would have to do, it would surely be smart enough to optimize away the whole loop.
In summary, I suspect that the loop is being optimized away completely and the 0.432 seconds you're seeing is just overhead. You have not given any indication how you're timing the above loop, so I can't be sure. You can check this out for yourself by replacing the ~1e9 loop iterations with 1e10. If it doesn't take about 10x as long, you're not timing what you think you're timing.
There's a lot more to say about benchmarking and profiling, but I'll leave it at that.
I know this is very late, but I hope it helps someone.
Emmet.

Related

What is 'differential timing' technique for doing benchmarks?

At around 39 minute of "Writing Fast Code I" by Andrei Alexandrescu (link here to youtube)
there is a slide of how to use differential timing... can someone show me some basic code with this approach? It was only mentioned for a second, but I think that's an interesting idea.
Run baseline 2n times (t2a)
vs. baseline n times + contender n times (ta+b).
Relative improvement = "t2a / (2ta+b - t2a)"
some overhead noises canceled
Alexsandrescu's slide is rather trivial to pour into code:
auto start = clock::now();
for( int i = 0; i < 2*n; i++ )
baseline();
auto t2a = clock::now() - start;
start = clock::now();
for( int i = 0; i < n; i++ )
baseline();
// *
for( int i = 0; i < n; i++ )
contender();
auto taplusb = clock::now() - start;
double r = t2a / (2 * taplusb - t2a) // relative speedup
* Synchronization point which prevents optimization across the last two loops.
I would be more interested in the mathematical reasoning behind measuring the relative speed up this way as opposed to simply tBaseline / tContender as I've been doing for ever. He only vaguely hints at '...overhead noise (being) cancelled (out)', but doesn't explain it in detail.
If you keep watching until 41:40 or so, he mentions it again when warning about the pitfall of first run vs. subsequent (allocators warmed up, etc.)
The best solution for that is doing warm-up runs before the first timed region.
I think he's picturing that 2n baseline vs. n baseline + n contender in separate invocations of the benchmark program.
So instead of doing some warmup runs before the timed region, he's using the baseline as a controlled warmup inside the timed region. This might make it possible to just time the whole program, e.g. perf stat, instead of calling a time function inside the program. Depending on how much process startup overhead your OS has vs. how long you make your repeat loop.
Microbenchmarking is hard and there are many pitfalls. Notably benchmarking optimized code while still making sure there isn't optimization between iterations of your repeat loop. (Often it's useful to use inline asm "escape" macros to force the compiler to materialize a value in an integer register, and/or to forget about the value of a variable to defeat CSE. Sometimes it's sufficient to just add the final result of each iteration to a sum that you print at the end.)
This is the first I've heard of this differential idea. It doesn't sound more useful than normal warm-ups.
If anything it will make the contender look slightly worse than using the function under test for some warm-up runs before the timed region. Using the same function as the timed region will warm up branch-prediction for it. Or not because after inlining the warm-up vs. main versions will be at different addresses. The same pattern at different addresses may possibly still help a modern TAGE predictor but IDK.
Or if contender has any lookup tables, those will become hot in cache from the warmup.
In any case, warmups are essential, unless you make the repeat count long enough to dwarf the time it takes for the CPU to switch to max turbo and so on. And to page-fault in all the memory you touch.
If your calculated time/iteration doesn't stay constant with your repeat count, your microbenchmark is broken.
Take the rest of his advice with a grain of salt, too. Most of it use useful (e.g. prefer 32-bit integers even for local temporaries, not just for arrays for cache-footprint reasons), but the reasoning is wrong for some of them.
His explanation that an ALU can do 2x 32-bit adds or 1x 64-bit add only applies to SIMD: 4x 32-bit int in a vector for paddd or 2x 64-bit int in a vector for paddq. But x86 scalar add r32, r32 has the same throughput as add r64,r64. I don't think it was true even on Pentium 4 (Nocona) despite P4 having funky double-pumped ALUs with 0.5 cycle latency for add. At least before Prescott/Nocona which introduced 64-bit support.
Using 32-bit unsigned integers on x86-64 can stop the compiler from optimizing to pointer increments if it wants to. It has to maintain correctness in case of 32-bit wraparound of a variable before array indexing.
Using 16-bit or 8-bit locals to match the data in an array can sometimes help auto-vectorization, IIRC. Gcc/clang sometimes make really braindead code that unpacks to 32-bit and then re-packs down to 8-bit elements, when processing an array of int8_t or uint8_t. I forget if I've every actually worked around that by using narrow locals, though. C default integer promotions bring most expressions back up to 32-bit.
Also, at https://youtu.be/vrfYLlR8X8k?t=3498, he claims that FP->int is expensive. That's never been true on x86-64: FP math uses SSE/SSE2 which has an instruction that does truncating conversion. FP->int used to be slow in the bad old days of x87 math, where you had to change the FP rounding mode, fistp, then change it back, to get C truncation semantics. But SSE includes cvttsd2si exactly for that common case.
He also says float is no faster than double. That's true for scalar (other than div/sqrt), but if your code can auto-vectorize then you get twice as much work done per instruction and the instructions have the same throughput. (Twice as many elements fit in a SIMD vector.)
How the math works:
It just cancels out the n * baseline time from both parts, effectively doing (2 * baseline) / (2*contender) = baseline/contender.
It assumes that the times add normally (not overlapping computation). t_2a = 2 * baseline, and 2 * t_ab = 2 * baseline + 2 * contender. Subtracting cancels the 2*baseline parts, leaving you with 2*contender.
The trick isn't in the math, if anything this is more mathematically dangerous because subtracting two larger numbers accumulates error. i.e. if the n*baseline actually takes different amounts of time in the two runs (because you didn't control that perfectly), then it doesn't cancel and contributes error to your estimate.

Calculating FLops

I am writing a program to calculate the duration that my CPU take to do one "FLops". For that I wrote the code below
before = clock();
y= 4.8;
x= 2.3;
z= 0;
for (i = 0; i < MAX; ++i){
z=x*y+z;
}
printf("%1.20f\n", ( (clock()-before )/CLOCKS_PER_SEC )/MAX);
The problem that I am repeating the same operation. Doesn't the compiler optimize this sort of "Thing"? If so what I have to do to get the correct results?
I am not using the "rand" function so it does not conflict my result.
This has a loop-carried dependency and not enough stuff to do in parallel, so if anything is even executed at all, it would not be FLOPs that you're measuring, with this you will probably measure the latency of floating point addition. The loop carried dependency chain serializes all those additions. That chain has some little side-chains with multiplications in them, but they don't depend on anything so only their throughput matters. But that throughput is going to be better than the latency of an addition on any reasonable processor.
To actually measure FLOPs, there is no single recipe. The optimal conditions depend strongly on the microarchitecture. The number of independent dependency chains you need, the optimal add/mul ratio, whether you should use FMA, it all depends. Typically you have to do something more complicated than what you wrote, and if you're set on using a high level language, you have to somehow trick it into actually doing anything at all.
For inspiration see how do I achieve the theoretical maximum of 4 FLOPs per cycle?
Even if you have no compiler optimization going on (possibilities have already been nicely listed), your variables and result will be in cache after the first loop iteration and from then on your on the track with way more speed and performance than you would be, if the program would have to fetch new values for each iteration.
So if you want to calculate the time for a single flop for a single iteration of this program you would actually have to give new input for every iteration. Really consider using rand() and just seed with a known value srand(1) or so.
Your calculations should also be different; flops are the number of computations your program does so in your case 2*n (where n = MAX). To calculate the amount of time per flop divide time used by the amount of flops.

Determine FLOPS of our ASM program

We had to implement an ASM program for multiplying sparse matrices in the coordinate scheme format (COOS) as well as in the compressed row format (CSR). Now that we have implemented all these algorithms we want to know how much more performant they are in contrast to the usual matrix multiplication. We already implemented code to measure the running time of all these algorithms but now we decided that we also want to know how many floating points operations per seconds (FLOPS) we can perform.
Any suggestion of how to measure/count this?
Here some background information on the used system:
processor : 0
model name : ARMv7 Processor rev 2 (v7l)
Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x3
CPU part : 0xc08
CPU revision : 2
Our first idea was now to implement a kind of FPO counter which we increment after each floating point operation (Arithmetic operations as well as comparison and move operations), but that would mean that we have to insert increment operations all over our code which also slows down the application ...
Does anyone know if there is some kind of hardware counter which counts the number of floating point operations or maybe if there exist some kind of performance tool which can be used to monitor our program and measures the number of FPOs.
Any suggestions or pointers would be appreciated.
Here is the evaluation of the FLOPs for a matrix multiplication by using the counting approach. We first measured the running time than inserted counters for each instruction we were interested in and after that we calculated the number of floating point operations per second.
It looks like the closest you can get with the performance events supported by Cortex-A8 is a count of total instructions executed, which isn't very helpful given that "an instruction" performs anything from 0 to (I think) 8 FP operations. Taking a step back, it becomes apparent that trying to measure FLOPS for the algorithm in hardware wouldn't really work anyway - e.g. you could write an implementation using vector ops but not always put real data in all lanes of each vector, then the CPU needs to be psychic to know how many of the FP operations it's executing actually count.
Fortunately, given a formal definition of an algorithm, calculating the number of operations involved should be fairly straightforward (although not necessarily easy, depending on the complexity). For instance, running through it in my head, the standard naïve multiplication of an m x n matrix with an n x m matrix comes out to m * m * (n + n - 1) operations (n multiplications and (n - 1) additions per output element). Once on-paper analysis has come up with an appropriately parameterised op-counting formula, you can plumb that into your benchmarking tool to calculate numbers for the data on test.
Once you've done all that, you'll probably then start regretting spending all the time to do it, because what you'll have is (arbitrary number) / (execution time) which is little more meaningful than (execution time) alone, and mostly just complicates comparison between cases where (arbitrary number) differs. NEON performance in particular is dominated by pipeline latency and memory bandwidth, and as such the low-level implementation details could easily outweigh any inherent difference the algorithms might have.
Think of it this way: say on some given 100MHz CPU a + a + b + b takes 5 cycles total, while (a + b) * 2 takes 4 cycles total* - the former scores 60 MFLOPS, the latter only 50 MFLOPS. Are you going to say that more FLOPS means better performance, in which case the routine which takes 25% longer to give the same result is somehow "better"? Are you going to say fewer FLOPS means better performance, which is clearly untrue for any reasonable interpretation? Or are you going to conclude that FLOPS is pretty much meaningless for anything other than synthetic benchmarks to compare the theoretical maximum bandwidth of one CPU with another?
* numbers pulled out of thin air for the sake of argument; however they're actually not far off something like Cortex-M4F - a single-precision FPU where both add and multiply are single-cycle, plus one or two for register hazards.
Number of Cores x Average frequency x Operations percycle

For loop run time

I was having trouble understanding the following concepts of how processor speed affects how long a certain loop runs for.
For a computer with a 3GHz processor, and can do 64-bit arithmetic per cycle, for how long will the following loop run?
long long int x;
for(x = 0 x<=0; x--){}
The compiler may optimize this loop out entirely because it may detect that no result is ever used.
But if the loop is actually compiled, a guess at an upper bound on speed might be two cycles per iteration. Yes, the processor is probably superscaler, so it can sometimes execute more than one instruction in a cycle, but on the other hand one instruction is a branch, which tends to break the pipeline.
So, if we guess two cycles, then it will take about a century to run that loop.
irb> 2**63/(3*10**9)/60/60/24/7/52 # => 97 years
I'm tempted to say that the loop will never finish, as this is much longer than the MTBF for servers, UPS equipment, and power grids, but perhaps you could run it in a VM and checkpoint it periodically. :-)
Of course, there is that Greek fable on the folly of speculation when empirical evidence is available. Why not run the loop for a small amount of time and then calculate the actual result for 263 iterations? The speculation is difficult because few people other than the designers really understand today's complex microarchitectures. There are also many practical problems: does the compiler get to unroll the loop? Perhaps you should just write it in assembly so you can measure something specific?

What is the maximum theoretical speed-up due to SSE for a simple binary subtraction?

In trying to figure out whether or not my code's inner loop is hitting a hardware design barrier or a lack of understanding on my part barrier. There's a bit more to it, but the simplest question I can come up with to answer is as follows:
If I have the following code:
float px[32768],py[32768],pz[32768];
float xref, yref, zref, deltax, deltay, deltaz;
initialize_with_random(px);
initialize_with_random(py);
initialize_with_random(pz);
for(i=0;i<32768-1;i++) {
xref=px[i];
yref=py[i];
zref=pz[i];
for(j=0;j<32768-1;j++ {
deltx=xref-px[j];
delty=yref-py[j];
deltz=zref-pz[j];
} }
What type of maximum theoretical speed up would I be able to see by going to SSE instructions in a situation where I have complete control over code (assembly, intrinsics, whatever) but no control over runtime environment other than architecture (i.e. it's a multi-user environment so I can't do anything about how the OS kernel assigns time to my particular process).
Right now I'm seeing a speed up of 3x with my code, when I would have thought using SSE would give me much more vector depth than the 3x speed up is indicating (presumably the 3x speed up tells me I have a 4x maximum theoretical throughput). (I've tried things such as letting deltx/delty/deltz be arrays in case the compiler wasn't smart enough to auto-promote them, but I still see only 3x speed up.) I'm using the intel C compiler with the appropriate compiler flags for vectorization, but no intrinsics obviously.
It depends on the CPU. But the theoretical max won't get above 4x. I don't know of a CPU which can execute more than one SSE instruction per clock cycle, which means that it can at most compute 4 values per cycle.
Most CPU's can do at least one floating point scalar instruction per cycle, so in this case you'd see a theoretical max of a 4x speedup.
But you'll have to look up the specific instruction throughput for the CPU you're running on.
A practical speedup of 3x is pretty good though.
I think you'd probably have to interleave the inner loop somehow. The 3-component vector is getting done at once, but that's only 3 operations at once. To get to 4, you'd do 3 components from the first vector, and 1 from the next, then 2 and 2, and so on. If you established some kind of queue that loads and processes the data 4 components at a time, then separate it after, that might work.
Edit: You could unroll the inner loop to do 4 vectors per iteration (assuming the array size is always a multiple of 4). That would accomplish what I said above.
Consider: How wide is a float? How wide is the SSEx instruction? The ratio should should give you some kind of reasonable upper bound.
It's also worth noting that out-of-order pipes play havok with getting good estimates of speedup.
You should consider loop tiling - the way you are accessing values in the inner loop is probably causing a lot of thrashing in the L1 data cache. It's not too bad, because everything probably still fits in the L2 at 384 KB, but there is easily an order of magnitude difference between an L1 cache hit and an L2 cache hit, so this could make a big difference for you.

Resources