Determine FLOPS of our ASM program - c

We had to implement an ASM program for multiplying sparse matrices in the coordinate scheme format (COOS) as well as in the compressed row format (CSR). Now that we have implemented all these algorithms we want to know how much more performant they are in contrast to the usual matrix multiplication. We already implemented code to measure the running time of all these algorithms but now we decided that we also want to know how many floating points operations per seconds (FLOPS) we can perform.
Any suggestion of how to measure/count this?
Here some background information on the used system:
processor : 0
model name : ARMv7 Processor rev 2 (v7l)
Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls vfpd32
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x3
CPU part : 0xc08
CPU revision : 2
Our first idea was now to implement a kind of FPO counter which we increment after each floating point operation (Arithmetic operations as well as comparison and move operations), but that would mean that we have to insert increment operations all over our code which also slows down the application ...
Does anyone know if there is some kind of hardware counter which counts the number of floating point operations or maybe if there exist some kind of performance tool which can be used to monitor our program and measures the number of FPOs.
Any suggestions or pointers would be appreciated.
Here is the evaluation of the FLOPs for a matrix multiplication by using the counting approach. We first measured the running time than inserted counters for each instruction we were interested in and after that we calculated the number of floating point operations per second.

It looks like the closest you can get with the performance events supported by Cortex-A8 is a count of total instructions executed, which isn't very helpful given that "an instruction" performs anything from 0 to (I think) 8 FP operations. Taking a step back, it becomes apparent that trying to measure FLOPS for the algorithm in hardware wouldn't really work anyway - e.g. you could write an implementation using vector ops but not always put real data in all lanes of each vector, then the CPU needs to be psychic to know how many of the FP operations it's executing actually count.
Fortunately, given a formal definition of an algorithm, calculating the number of operations involved should be fairly straightforward (although not necessarily easy, depending on the complexity). For instance, running through it in my head, the standard naïve multiplication of an m x n matrix with an n x m matrix comes out to m * m * (n + n - 1) operations (n multiplications and (n - 1) additions per output element). Once on-paper analysis has come up with an appropriately parameterised op-counting formula, you can plumb that into your benchmarking tool to calculate numbers for the data on test.
Once you've done all that, you'll probably then start regretting spending all the time to do it, because what you'll have is (arbitrary number) / (execution time) which is little more meaningful than (execution time) alone, and mostly just complicates comparison between cases where (arbitrary number) differs. NEON performance in particular is dominated by pipeline latency and memory bandwidth, and as such the low-level implementation details could easily outweigh any inherent difference the algorithms might have.
Think of it this way: say on some given 100MHz CPU a + a + b + b takes 5 cycles total, while (a + b) * 2 takes 4 cycles total* - the former scores 60 MFLOPS, the latter only 50 MFLOPS. Are you going to say that more FLOPS means better performance, in which case the routine which takes 25% longer to give the same result is somehow "better"? Are you going to say fewer FLOPS means better performance, which is clearly untrue for any reasonable interpretation? Or are you going to conclude that FLOPS is pretty much meaningless for anything other than synthetic benchmarks to compare the theoretical maximum bandwidth of one CPU with another?
* numbers pulled out of thin air for the sake of argument; however they're actually not far off something like Cortex-M4F - a single-precision FPU where both add and multiply are single-cycle, plus one or two for register hazards.

Number of Cores x Average frequency x Operations percycle

Related

What do single-cycle multiplication and hardware division mean?

I am going through a data-sheet and read "Single-cycle multiplication and hardware division" as part of STM32 specifications, i am not sure i understand what that means. From what I read on the Net, multiplication is usually easier to compute than division. Would that mean that STM's can compute both multiplication and division within one cycle?
Please assist.
When it comes to the multiplier, it means that it takes only one clock cycle (this is, for 100Mhz, 10 nanoseconds) to perform the operation.
However, the division is usually performed in an iterative fashion, bit by bit, and the particular implementation (the core instruction set) should be looked into.
Having a look at Cortex M-Series you see that the multiplication is in fact single-cycle, however the division lasts 2-12 cycles, and in the footnote regarding this:
Division operations use early termination to minimize the number of cycles required based on the number of leading ones and zeroes in the input operands.
Added:
Notice, however, that the only INTxINT multiplications are single-cycle, whereas LONGxLONG last 3-5 cycles (as a LONGxLONG mult can be performed as a combination of INTxINT multiplications and additions)

Calculating Floating point Operations Per Second(FLOPS) and Integer Operations Per Second(IOPS)

I am trying to learn some basic benchmarking. I have a loop in my Java program like,
float a=6.5f;
int b=3;
for(long j=0; j<999999999; j++){
var = a*b+(a/b);
}//end of for
My processor takes around 0.431635 second to process this. How would I calculate processor speed in terms of Flops(Floating point Operations Per Second) and Iops(Integer Operations Per Second)? Can you provide explanations with some steps?
You have a single loop with 999999999 iterations: lets call this 1e9 (one billion) for simplicity. The integers will get promoted to floats in the calculations that involve both, so the loop contains 3 floating-point operations: one mult, one add, and one div, so there are 3e9. This takes 0.432s, so you're apparently getting about 6.94 GFLOP/s (3e9/0.432). Similarly, you are doing 1 integer op (j++) per loop iteration, so you are getting 1e9/0.432 or about 2.32 GIOP/s.
However, the calculation a*b+(a/b) is loop-invariant, so it would be pretty surprising if this didn't get optimized away. I don't know much about Java, but any C compiler will evaluate this at compile-time, remove the a and b variables and the loop, and (effectively) replace the whole lot with var=21.667;. This is a very basic optimization, so I'd be surprised if javac didn't do it too.
I have no idea what's going on under the hood in Java, but I'd be suspicious of getting 7 GFLOPs. Modern Intel CPUs (I'm assuming that's what you've got) are, in principle, capable of two vector arithmetic ops per clock cycle with the right instruction mix (one add and one mult per cycle), so for a 3 GHz 4-core CPU, it's even possible to get 3e9*4*8 = 96 single-precision GFLOPs under ideal conditions. The various mul and add instructions have a reciprocal throughput of 1 cycle, but the div takes more than ten times as long, so I'd be very suspicious of getting more than about CLK/12 FLOPs (scalar division on a single core) once division is involved: if the compiler is smart enough to vectorize and/or parallelize the code to get more than that, which it would have to do, it would surely be smart enough to optimize away the whole loop.
In summary, I suspect that the loop is being optimized away completely and the 0.432 seconds you're seeing is just overhead. You have not given any indication how you're timing the above loop, so I can't be sure. You can check this out for yourself by replacing the ~1e9 loop iterations with 1e10. If it doesn't take about 10x as long, you're not timing what you think you're timing.
There's a lot more to say about benchmarking and profiling, but I'll leave it at that.
I know this is very late, but I hope it helps someone.
Emmet.

(n - Multiplication) vs (n/2 - multiplication + 2 additions) which is better?

I have a C program that has n multiplications (single multiplication with n iterations) and I found another logic that has n/2 iterations of (1 multiplication + 2 additions). I know about the complexity that both are of O(n). but in terms of CPU cycles. which is faster ?
Test on your computer. Or, look at the specs for your processor and guess.
The old logic no longer applies: on modern processors, an integer multiplication might be very cheap, on some newish Intel processors it's 3 clock cycles. Additions are 1 cycle on these same processors. However, in a modern pipelined processor, the stalls created by data dependencies might cause additions to take longer.
My guess is that N additions + N/2 multiplications is slower than N multiplications if you are doing a fold type operation, and I would guess the reverse for a map type operation. But this is only a guess.
Test if you want the truth.
However: Most algorithms this simple are memory-bound, and both will be the same speed.
First of all follow Dietrich Epp's first advice - measuring is (at least for complex optimization problems) the only way to be sure.
Now if you want to figure out why one is faster than the other, we can try. There are two different important performance measures: Latency and reciprocal throughput. A short summary of the two:
Latency: This is the delay that the instruction generates in a
dependency chain. The numbers are minimum values. Cache misses,
misalignment, and exceptions may increase the clock counts
considerably. Where hyperthreading is enabled, the use of the same
execution units in the other thread leads to inferior performance.
Denormal numbers, NAN’s and infinity do not increase the latency. The
time unit used is core clock cycles, not the reference clock cycles
given by the time stamp counter.
Reciprocal throughput: The average number of core clock cycles per
instruction for a series of independent instructions of the same kind
in the same thread.
For Sandy bridge the rec. throughput for an add r, r/i (for further notice r=register, i=immediate, m=memory) is 0.33 while the latency is 1.
An imul r, r has a latency of 3 and a rec. throughput of 1.
So as you see it completely depends on your specific algorithm - if you can just replace one imul with two independent adds this particular part of your algorithm could get a theoretical speedup of 50% (and in the best case obviously a speedup of ~350%). But on the other hand if your adds add a problematic dependency one imul could be just as fast as one add.
Also note that we've ignored all the additional complications like memory and cache behavior (things which will generally have a much, MUCH larger influence on the execution time) or intricate stuff like µop fusion and whatnot. In general the only people that should care about this stuff are compiler writers - it's much simpler to just measure the result of their efforts ;)
Anyways if you want a good listing of this stuff see this here (the above description of latency/rec. throughput is also from that particular document).

Which is faster — sorting or multiplying a small array of elements?

Reading through Cactus Kev's Poker Hand Evaluator, I noticed the following statements:
At first, I thought that I could always simply sort the hand first before passing it to the evaluator; but sorting takes time, and I didn't want to waste any CPU cycles sorting hands. I needed a method that didn't care what order the five cards were given as.
...
After a lot of thought, I had a brainstorm to use prime numbers. I would assign a prime number value to each of the thirteen card ranks... The beauty of this system is that if you multiply the prime values of the rank of each card in your hand, you get a unique product, regardless of the order of the five cards.
...
Since multiplication is one of the fastest calculations a computer can make, we have shaved hundreds of milliseconds off our time had we been forced to sort each hand before evaluation.
I have a hard time believing this.
Cactus Kev represents each card as a 4-byte integer, and evaluates hands by calling eval_5cards( int c1, int c2, int c3, int c4, int c5 ). We could represent cards as one byte, and a poker hand as a 5-byte array. Sorting this 5-byte array to get a unique hand must be pretty fast. Is it faster than his approach?
What if we keep his representation (cards as 4-byte integers)? Can sorting an array of 5 integers be faster than multiplying them? If not, what sort of low-level optimizations can be done to make sorting a small number of elements faster?
Thanks!
Good answers everyone; I'm working on benchmarking the performance of sorting vs multiplication, to get some hard performance statistics.
Of course it depends a lot on the CPU of your computer, but a typical Intel CPU (e.g. Core 2 Duo) can multiply two 32 Bit numbers within 3 CPU clock cycles. For a sort algorithm to beat that, the algorithm needs to be faster than 3 * 4 = 12 CPU cycles, which is a very tight constraint. None of the standard sorting algorithms can do it in less than 12 cycles for sure. Alone the comparison of two numbers will take one CPU cycle, the conditional branch on the result will also take one CPU cycle and whatever you do then will at least take one CPU cycle (swapping two cards will actually take at least 4 CPU cycles). So multiplying wins.
Of course this is not taking the latency into account to fetch the card value from either 1st or 2nd level cache or maybe even memory; however, this latency applies to either case, multiplying and sorting.
Without testing, I'm sympathetic to his argument. You can do it in 4 multiplications, as compared to sorting, which is n log n. Specifically, the optimal sorting network requires 9 comparisons. The evaluator then has to at least look at every element of the sorted array, which is another 5 operations.
Sorting is not intrinsically harder than multiplying numbers. On paper, they're about the same, and you also need a sophisticated multiplication algorithm to make large multiplication competitive with large sort. Moreover, when the proposed multiplication algorithm is feasible, you can also use bucket sort, which is asymptotically faster.
However, a poker hand is not an asymptotic problem. It's just 5 cards and he only cares about one of the 13 number values of the card. Even if multiplication is complicated in principle, in practice it is implemented in microcode and it's incredibly fast. What he's doing works.
Now, if you're interested in the theoretical question, there is also a solution using addition rather than multiplication. There can only be 4 cards of any one value, so you could just as well assign the values 1,5,25,...,5^12 and add them. It still fits in 32-bit arithmetic. There are also other addition-based solutions with other mathematical properties. But it really doesn't matter, because microcoded arithmetic is so much faster than anything else that the computer is doing.
5 elements can be sorted using an optimized decision tree, which is much faster than using a general-purpose sorting algorithm.
However, the fact remains that sorting means lots of branches (as do the comparisons that are necessary afterwards). Branches are really bad for modern pipelined CPU architectures, especially branches that go either way with similar likelihood (thus defeating branch prediction logic). That, much more than the theoretical cost of multiplication vs. comparisons, makes multiplication faster.
But if you could build custom hardware to do the sorting, it might end up faster.
That shouldn't really be relevant, but he is correct. Sorting takes much longer than multiplying.
The real question is what he did with the resulting prime number, and how that was helpful (since factoring it I would expect to take longer than sorting.
It's hard to think of any sorting operation that could be faster than multiplying the same set of numbers. At the processor level, the multiplication is just load, load, multiply, load, multiply, ..., with maybe some manipulation of the accumulator thrown in. It's linear, easily pipelined, no comparisons with the associated branch mis-prediction costs. It should average about 2 instructions per value to be multiplied. Unless the multiply instruction is painfully slow, it's really hard to imagine a faster sort.
One thing worth mentioning is that even if your CPU's multiply instruction is dead slow (or nonexistent...) you can use a lookup table to speed things even further.
After a lot of thought, I had a brainstorm to use prime numbers. I would assign a prime number value to each of the thirteen card ranks... The beauty of this system is that if you multiply the prime values of the rank of each card in your hand, you get a unique product, regardless of the order of the five cards.
That's a example of a non-positional number system.
I can't find the link to the theory. I studied that as part of applied algebra, somewhere around the Euler's totient and encryption. (I can be wrong with terminology as I have studied all that in my native language.)
What if we keep his representation (cards as 4-byte integers)? Can sorting an array of 5 integers be faster than multiplying them?
RAM is an external resource and is generally slower compared to the CPU. Sorting 5 of ints would always have to go to RAM due to swap operations. Add here the overhead of sorting function itself, and multiplication stops looking all that bad.
I think on modern CPUs integer multiplication would pretty much always faster than sorting, since several multiplications can be executed at the same time on different ALUs, while there is only one bus connecting CPU to RAM.
If not, what sort of low-level optimizations can be done to make sorting a small number of elements faster?
5 integers can be sorted quite quickly using bubble sort: qsort would use more memory (for recursion) while well optimized bubble sort would work completely from d-cache.
As others have pointed out, sorting alone isn't quicker than multiplying for 5 values. This ignores, however, the rest of his solution. After disdaining a 5-element sort, he proceeds to do a binary search over an array of 4888 values - at least 12 comparisons, more than the sort ever required!
Note that I'm not saying there's a better solution that involves sorting - I haven't given it enough thought, personally - just that sorting alone is only part of the problem.
He also didn't have to use primes. If he simply encoded the value of each card in 4 bits, he'd need 20 bits to represent a hand, giving a range of 0 to 2^20 = 1048576, about 1/100th of the range produced using primes, and small enough (though still suffering cache coherency issues) to produce a lookup table over.
Of course, an even more interesting variant is to take 7 cards, such as are found in games like Texas Holdem, and find the best 5 card hand that can be made from them.
The multiplication is faster.
Multiplication of any given array will always be faster than sorting the array, presuming the multiplication results in a meaningful result, and the lookup table is irrelevant because the code is designed to evaluate a poker hand so you'd need to do a lookup on the sorted set anyway.
An example of a ready made Texas Hold'em 7- and 5-card evaluator can be found here with documentation and further explained here. All feedback welcome at the e-mail address found therein.
You don't need to sort, and can typically (~97% of the time) get away with just 6 additions and a couple of bit shifts when evaluating 7-card hands. The algo uses a generated look up table which occupies about 9MB of RAM and is generated in a near-instant. Cheap. All of this is done inside of 32-bits, and "inlining" the 7-card evaluator is good for evaluating about 50m randomly generated hands per second on my laptop.
Oh, and multiplication is faster than sorting.

FLOPS Intel core and testing it with C (innerproduct)

I have some misconceptions about measuring flops, on Intel architecture, is a FLOP one addition and one multiplication together? I read about this somewhere online and there is no debate that could reject this. I know that FLOP has a different meaning on different types of cpu.
How do I calculate my theoretical peak FLOPS? I am using Intel(R) Core(TM)2 Duo CPU E7400 # 2.80GHz. What exactly is the relationship between GHz and FLOPS? (even wikipedia's entry on FLOPS does NOT specify how to do this)
I will be using the following methods to measure the actual performance of my computer (in terms of flops): Inner product of two vectors: for two vectors of size N, is the number of flops 2n(n -1) (if one addition or one multiplication is considered to be 1 flop). If not, how should I go about calculating this?
I know there better ways to do so, but I would like to know whether my proposed calculations are right. I read somewhere about LINPACK as a benchmark, but I would still like to know how it's done.
As for your 2nd question, the theoretical FLOPS calculation isn't too hard. It can be broken down into roughly:
(Number of cores) * (Number of execution units / core) * (cycles / second) * (Execution unit operations / cycle) * (floats-per-register / Execution unit operation)
A Core-2 Duo has 2 cores, and 1 execution unit per core. an SSE register is 128 bits wide. a float is 32 bits wide so you can store 4 floats per register. I assume the execution unit does 1 SSE operation per cycle. So it should be:
2 * 1 * 2.8 * 1 * 4 = 22.4 GFLOPS
which matches:
http://www.intel.com/support/processors/sb/cs-023143.htm
This number is obviously purely theoretical best case performance. Real world performance will most likely not come close to this due to a variety of reasons. It's probably not worth trying to directly correlate flops to actual app runtime, you'd be better off trying out the computations used by your applicaton.
This article shows some theory on FLOPS numbers for x86 CPUs. It's only current up to Pentium 4, but perhaps you can extrapolate.
A FLOP stands for Floating Point Operation.
It means the same in any architecture that supports floating point operations, and is usually measured as the ammount of operations that can take place in any one second (as in FLOPS; floating point operations per second).
here you can find tools to measure your computer's FLOPS.
Intel's data sheets contain GFLOPS numbers and your processor has a claimed 22.4
http://www.intel.com/support/processors/sb/CS-023143.htm
Since your machine is dual core that means 11.2 GFlops per core at 2.8 GHz. Divide this out and you get 4. So Intel claims that their cores can each do 4 FLOPS per cycle.

Resources