Ops per cycle ARM Cortex CPUs? - arm

I need the number of operations per cycle that an ARM processor can execute, in particular those of Cortex-A7, Cortex-A9 and Cortex-A15.
I can't find anything online!
Thank you
EDIT: I need it for calculating the theoretical peak performance.

I have not looked into integers yet but for single and double floating operations per cycle this is what I have come up with so far (from flops-per-cycle-for-sandy-bridge-and-haswell-sse2-avx-avx2, peak-flops-per-cycle-for-arm11-and-cortex-a7-cores-in-raspberry-pi-1-and-2, and Cortex-A9 NEON Media Processing Engine Technical Reference Manual).
Cortex-A7:
0.5 DP FLOPs/cycle: scalar VMLA.F64 every four cycles.
1.0 DP FLOPs/cycle: scalar VADD.F64 every cycle.
2.0 SP FLOPs/cycle: scalar VMLA.F32 every cycle.
2.0 SP FLOPs/cycle: 2-wide VMLA.F32 every other cycle.
Cortex-A9:
1.5 DP FLOPs/cycle: scalar VMLA.F64 + scalar VADD.F64 every other cycle.
4.0 SP FLOPs/cycle: 2-wide VMLA.F32 every cycle.
Cortex-A15:
2.0 DP FLOPs/cycle: scalar VMLA.F64 (or VFMA.F64) every cycle.
8.0 SP FLOPs/cycle: 4-wide VMLA.F32 (or VFMA.F32) every cycle.
One interesting observation is that Neon floating point no faster than VFP for the Cortex-A7.

Just look at the most common places.
You have it there: http://en.wikipedia.org/wiki/List_of_ARM_microarchitectures

Related

What do single-cycle multiplication and hardware division mean?

I am going through a data-sheet and read "Single-cycle multiplication and hardware division" as part of STM32 specifications, i am not sure i understand what that means. From what I read on the Net, multiplication is usually easier to compute than division. Would that mean that STM's can compute both multiplication and division within one cycle?
Please assist.
When it comes to the multiplier, it means that it takes only one clock cycle (this is, for 100Mhz, 10 nanoseconds) to perform the operation.
However, the division is usually performed in an iterative fashion, bit by bit, and the particular implementation (the core instruction set) should be looked into.
Having a look at Cortex M-Series you see that the multiplication is in fact single-cycle, however the division lasts 2-12 cycles, and in the footnote regarding this:
Division operations use early termination to minimize the number of cycles required based on the number of leading ones and zeroes in the input operands.
Added:
Notice, however, that the only INTxINT multiplications are single-cycle, whereas LONGxLONG last 3-5 cycles (as a LONGxLONG mult can be performed as a combination of INTxINT multiplications and additions)

Would a C6000 DSP be outperformed by a Cortex A9 for FP

I'm using an OMAP L138 processor at the moment which does not have a hardware FPU. We will be processing spectral data using algorithms that are FP intensive thus the ARM side won't be adequate. I'm not the algorithm person but one is "Dynamic Time Warping" (I don't know what it means, no). The initial performance numbers are:
Core i7 Laptop# 2.9GHz: 1 second
Raspberry Pi ARM1176 # 700MHz: 12 seconds
OMAP L138 ARM926 # 300MHz: 193 seconds
Worse, the Pi is about 30% of the price of the board I'm using!
I do have a TI C674x which is the other processor in the OMAP L138. The question is would I be best served by spending many weeks trying to:
learn the DSPLINK, interop libraries and toolchain not to mention forking out for the large cost of Code Composer or
throwing the L138 out and moving to a Dual Cortex A9 like the Pandaboard, possibly suffering power penalties in the process.
(When I look at FPU performance on the A8, it isn't an improvement over the Rasp Pi but Cortex A9 seems to be).
I understand the answer is "it depends". Others here have said that "you unlock an incredible fast DSP that can easily outperform the Cortex-A8 if assigned the right job" but for a defined job set would I be better off skipping to the A9, even if I had to buy an external DSP later?
That question can't be answered without knowing the clock-rates of DSP and the ARM.
Here is some background:
I just checked the cycles of a floating point multiplication on the c674x DSP:
It can issue two multiplications per cycle, and each multiplication has a result latency of three cycles (that means you have to wait three additional cycles before the result appears in the destination register).
You can however start two multiplications each cycle because the DSP will not wait for the result. The compiler/assembler will do the required scheduling for you.
That only uses two of the available eight functional units of the DSP, so while you do the two multiplications you can per cycle also do:
two load/stores (64 bit wide)
six floating point add/subtract instructions (or integer instructions)
Loop control and branching is free and does not cost you anything on the DSP.
That makes a total of six floating point operations per cycle with parallel loads/stores and loop control.
ARM-NEON on the other hand can, in floating point mode:
Issue two multiplications per cycle. Latency is comparable, and the instructions are also pipeline-able like on the DSP. Loading/storing takes extra time as does add/subtract stuff. Loop control and branching will very likely go for free in well written code.
So in summary the DSP does three times as much work per cycle as the Cortex-A9 NEON unit.
Now you can check the clock-rates of DSP and the ARM and see what is faster for your job.
Oh, one thing: With well-written DSP code you will almost never see a cache miss during loads because you move the data from RAM to the cache using DMA before you access the data. This gives impressive speed advantages for the DSP as well.
It does depend on the application but, generally speaking, it is rare these days for special purpose processors to beat general-purpose processors. General purpose processors now have have higher clock rates and multimedia acceleration. Even for a numerically intensive algorithm where a DSP may have an edge, the increased engineering complexity of dealing with a heterogeneous multi-processor environment makes this type of solution problematic from an ROI perspective.

(n - Multiplication) vs (n/2 - multiplication + 2 additions) which is better?

I have a C program that has n multiplications (single multiplication with n iterations) and I found another logic that has n/2 iterations of (1 multiplication + 2 additions). I know about the complexity that both are of O(n). but in terms of CPU cycles. which is faster ?
Test on your computer. Or, look at the specs for your processor and guess.
The old logic no longer applies: on modern processors, an integer multiplication might be very cheap, on some newish Intel processors it's 3 clock cycles. Additions are 1 cycle on these same processors. However, in a modern pipelined processor, the stalls created by data dependencies might cause additions to take longer.
My guess is that N additions + N/2 multiplications is slower than N multiplications if you are doing a fold type operation, and I would guess the reverse for a map type operation. But this is only a guess.
Test if you want the truth.
However: Most algorithms this simple are memory-bound, and both will be the same speed.
First of all follow Dietrich Epp's first advice - measuring is (at least for complex optimization problems) the only way to be sure.
Now if you want to figure out why one is faster than the other, we can try. There are two different important performance measures: Latency and reciprocal throughput. A short summary of the two:
Latency: This is the delay that the instruction generates in a
dependency chain. The numbers are minimum values. Cache misses,
misalignment, and exceptions may increase the clock counts
considerably. Where hyperthreading is enabled, the use of the same
execution units in the other thread leads to inferior performance.
Denormal numbers, NAN’s and infinity do not increase the latency. The
time unit used is core clock cycles, not the reference clock cycles
given by the time stamp counter.
Reciprocal throughput: The average number of core clock cycles per
instruction for a series of independent instructions of the same kind
in the same thread.
For Sandy bridge the rec. throughput for an add r, r/i (for further notice r=register, i=immediate, m=memory) is 0.33 while the latency is 1.
An imul r, r has a latency of 3 and a rec. throughput of 1.
So as you see it completely depends on your specific algorithm - if you can just replace one imul with two independent adds this particular part of your algorithm could get a theoretical speedup of 50% (and in the best case obviously a speedup of ~350%). But on the other hand if your adds add a problematic dependency one imul could be just as fast as one add.
Also note that we've ignored all the additional complications like memory and cache behavior (things which will generally have a much, MUCH larger influence on the execution time) or intricate stuff like µop fusion and whatnot. In general the only people that should care about this stuff are compiler writers - it's much simpler to just measure the result of their efforts ;)
Anyways if you want a good listing of this stuff see this here (the above description of latency/rec. throughput is also from that particular document).

Do I get a performance penalty when mixing SSE integer/float SIMD instructions

I've used x86 SIMD instructions (SSE1234) in the form of intrinsics quite a lot lately. What I found frustrating is that the SSE ISA has several simple instructions that are available only for floats or only for integers, but in theory should perform equally for both. For example, both float and double vectors have instructions to load higher 64bits of a 128-bit vector from an address (movhps, movhpd), but there's no such instruction for integer vectors.
My question:
Is there any reasons to expect a performance hit when using floating point instructions on integer vectors, e.g. using movhps to load data to an integer vector?
I wrote several tests to check that, but I suppose their results are not credible. It's really hard to write a correct test that explores all corner cases for such things, especially when the instruction scheduling is most probably involved here.
Related question:
Other trivially similar things also have several instructions that do basically the same. For example I can do bitwise OR with por, orps or orpd. Can anyone explain what's the purpose of these additional instructions? I guess this might be related to different scheduling algorithms applied to each instruction.
From an expert (obviously not me :P): http://www.agner.org/optimize/optimizing_assembly.pdf [13.2 Using vector instructions with other types of data than they are intended for (pages 118-119)]:
There is a penalty for using the wrong type of instructions on some processors. This is
because the processor may have different data buses or different execution units for integer
and floating point data. Moving data between the integer and floating point units can take
one or more clock cycles depending on the processor, as listed in table 13.2.
Processor Bypass delay, clock cycles
Intel Core 2 and earlier 1
Intel Nehalem 2
Intel Sandy Bridge and later 0-1
Intel Atom 0
AMD 2
VIA Nano 2-3
Table 13.2. Data bypass delays between integer and floating point execution units

FLOPS Intel core and testing it with C (innerproduct)

I have some misconceptions about measuring flops, on Intel architecture, is a FLOP one addition and one multiplication together? I read about this somewhere online and there is no debate that could reject this. I know that FLOP has a different meaning on different types of cpu.
How do I calculate my theoretical peak FLOPS? I am using Intel(R) Core(TM)2 Duo CPU E7400 # 2.80GHz. What exactly is the relationship between GHz and FLOPS? (even wikipedia's entry on FLOPS does NOT specify how to do this)
I will be using the following methods to measure the actual performance of my computer (in terms of flops): Inner product of two vectors: for two vectors of size N, is the number of flops 2n(n -1) (if one addition or one multiplication is considered to be 1 flop). If not, how should I go about calculating this?
I know there better ways to do so, but I would like to know whether my proposed calculations are right. I read somewhere about LINPACK as a benchmark, but I would still like to know how it's done.
As for your 2nd question, the theoretical FLOPS calculation isn't too hard. It can be broken down into roughly:
(Number of cores) * (Number of execution units / core) * (cycles / second) * (Execution unit operations / cycle) * (floats-per-register / Execution unit operation)
A Core-2 Duo has 2 cores, and 1 execution unit per core. an SSE register is 128 bits wide. a float is 32 bits wide so you can store 4 floats per register. I assume the execution unit does 1 SSE operation per cycle. So it should be:
2 * 1 * 2.8 * 1 * 4 = 22.4 GFLOPS
which matches:
http://www.intel.com/support/processors/sb/cs-023143.htm
This number is obviously purely theoretical best case performance. Real world performance will most likely not come close to this due to a variety of reasons. It's probably not worth trying to directly correlate flops to actual app runtime, you'd be better off trying out the computations used by your applicaton.
This article shows some theory on FLOPS numbers for x86 CPUs. It's only current up to Pentium 4, but perhaps you can extrapolate.
A FLOP stands for Floating Point Operation.
It means the same in any architecture that supports floating point operations, and is usually measured as the ammount of operations that can take place in any one second (as in FLOPS; floating point operations per second).
here you can find tools to measure your computer's FLOPS.
Intel's data sheets contain GFLOPS numbers and your processor has a claimed 22.4
http://www.intel.com/support/processors/sb/CS-023143.htm
Since your machine is dual core that means 11.2 GFlops per core at 2.8 GHz. Divide this out and you get 4. So Intel claims that their cores can each do 4 FLOPS per cycle.

Resources