multicore and coprocessor means same? - arm

I have doubht while understanding the meaning of ARM CortexA15.
my understanding is, it will have one processor(CPU) with 15 cores which uses ARMv7 architecture. Please correct me if this understanding is not correct.
multicore and coprocessor means the same or different. Can you help to understand if they are different.

A15 is just an ARM processor(CPU) model number. It comes with 1 - 4 cores and is based on ARM v7a.
Co-processors are compute units that help ARM (or any other processor for that matter) to do its operations more efficiently.
Multi-core means there is more than one CORE in the CPU.
Eg. Verstaile Express is a two CPU cluster based on a multicore 2x A15 and 3x A7 processors.

Related

Explanation of the arm cortex a/r/m numbering convention

I've been looking around some web sources, but I can not find the meaning of the numbers after the processor type of the ARM family. For example Cortex-A53, I know it refers to the application family, hence the A, the 5 might refer that it contains MMU(not sure though), but the 3 I have no idea...can you please provide an explanation or sources?
For the Cortex-A processors there are three major sub-groups which are worth knowing about:
Cortex-A3x => smaller cores, mostly designed for embedded systems and low-cost mobile.
Cortex-A5x => "LITTLE" cores in the Arm big.LITTLE / DynamIQ heterogeneous compute architecture (so lower peak performance than the "big" cores, but better energy efficiency).
Cortex-A7x => "big" cores in the Arm big.LITTLE / DynamIQ heterogeneous compute architecture (so higher peak performance than the "LITTLE" cores, but lower energy efficiency).
Within each those groups the bigger value of "x" will be the newer CPU cores, which nearly always have both improved energy efficiency and peak performance than the lower numbered ones within that group.
The specific numbers don't have specific decode for "has an MMU" or anything like that (unless you go back a long time - some of the early ARM7 and ARM9 CPU names did).
For Cortex-M and R, they don't really have the same tiers - in general bigger number = bigger and faster core with more recent ISA extensions to add new capabilities.
The only significant banding that exists is the Cortex-R5x series (which is ARMv8-R architecture including 64-bit support, where as the single digit R cores are all 32-bit Armv7 cores).

Would a C6000 DSP be outperformed by a Cortex A9 for FP

I'm using an OMAP L138 processor at the moment which does not have a hardware FPU. We will be processing spectral data using algorithms that are FP intensive thus the ARM side won't be adequate. I'm not the algorithm person but one is "Dynamic Time Warping" (I don't know what it means, no). The initial performance numbers are:
Core i7 Laptop# 2.9GHz: 1 second
Raspberry Pi ARM1176 # 700MHz: 12 seconds
OMAP L138 ARM926 # 300MHz: 193 seconds
Worse, the Pi is about 30% of the price of the board I'm using!
I do have a TI C674x which is the other processor in the OMAP L138. The question is would I be best served by spending many weeks trying to:
learn the DSPLINK, interop libraries and toolchain not to mention forking out for the large cost of Code Composer or
throwing the L138 out and moving to a Dual Cortex A9 like the Pandaboard, possibly suffering power penalties in the process.
(When I look at FPU performance on the A8, it isn't an improvement over the Rasp Pi but Cortex A9 seems to be).
I understand the answer is "it depends". Others here have said that "you unlock an incredible fast DSP that can easily outperform the Cortex-A8 if assigned the right job" but for a defined job set would I be better off skipping to the A9, even if I had to buy an external DSP later?
That question can't be answered without knowing the clock-rates of DSP and the ARM.
Here is some background:
I just checked the cycles of a floating point multiplication on the c674x DSP:
It can issue two multiplications per cycle, and each multiplication has a result latency of three cycles (that means you have to wait three additional cycles before the result appears in the destination register).
You can however start two multiplications each cycle because the DSP will not wait for the result. The compiler/assembler will do the required scheduling for you.
That only uses two of the available eight functional units of the DSP, so while you do the two multiplications you can per cycle also do:
two load/stores (64 bit wide)
six floating point add/subtract instructions (or integer instructions)
Loop control and branching is free and does not cost you anything on the DSP.
That makes a total of six floating point operations per cycle with parallel loads/stores and loop control.
ARM-NEON on the other hand can, in floating point mode:
Issue two multiplications per cycle. Latency is comparable, and the instructions are also pipeline-able like on the DSP. Loading/storing takes extra time as does add/subtract stuff. Loop control and branching will very likely go for free in well written code.
So in summary the DSP does three times as much work per cycle as the Cortex-A9 NEON unit.
Now you can check the clock-rates of DSP and the ARM and see what is faster for your job.
Oh, one thing: With well-written DSP code you will almost never see a cache miss during loads because you move the data from RAM to the cache using DMA before you access the data. This gives impressive speed advantages for the DSP as well.
It does depend on the application but, generally speaking, it is rare these days for special purpose processors to beat general-purpose processors. General purpose processors now have have higher clock rates and multimedia acceleration. Even for a numerically intensive algorithm where a DSP may have an edge, the increased engineering complexity of dealing with a heterogeneous multi-processor environment makes this type of solution problematic from an ROI perspective.

Which are the different variable cycle ARM instructions?

I was reading this book "ARM System Developers Guide" by Elsevier and I came across this:
The ARM instruction set differs from the pure RISC definition in several ways that make
the ARM instruction set suitable for embedded applications:
Variable cycle execution for certain instructions — Not every ARM instruction executes in a single cycle. For example, load-store-multiple instructions vary in the number of execution cycles depending upon the number of registers being transferred. The
transfer can occur on sequential memory addresses, which increases performance since
sequential memory accesses are often faster than random accesses. Code density is also
improved since multiple register transfers are common operations at the start and end
of functions.
Any other ARM instructions you guys can point out which take variable cycles to execute?
Cycle timings are micro architecture dependent, so you need to check particular implementation's technical reference manual (TRM). For example for Cortex-A9, it is described as being quite complicated.
The complexity of the Cortex-A9 processor makes it impossible to calculate precise timing information manually. The timing of an instruction is often affected by other concurrent instructions, memory system activity, and additional events outside the instruction flow.
However on the same document there are precise timings for data-processing, load and store, multiplication and some information about branch and serialization instructions.
For example from the same document you can see if shifting is involved AND instruction may take 1-2 cycles more depending on the shift source, which might be a constant embedded in instruction or read from a register.
Also next to book's note about load-store-multiple may vary on number of registers involved, they also vary if address is aligned or not.

Do I get a performance penalty when mixing SSE integer/float SIMD instructions

I've used x86 SIMD instructions (SSE1234) in the form of intrinsics quite a lot lately. What I found frustrating is that the SSE ISA has several simple instructions that are available only for floats or only for integers, but in theory should perform equally for both. For example, both float and double vectors have instructions to load higher 64bits of a 128-bit vector from an address (movhps, movhpd), but there's no such instruction for integer vectors.
My question:
Is there any reasons to expect a performance hit when using floating point instructions on integer vectors, e.g. using movhps to load data to an integer vector?
I wrote several tests to check that, but I suppose their results are not credible. It's really hard to write a correct test that explores all corner cases for such things, especially when the instruction scheduling is most probably involved here.
Related question:
Other trivially similar things also have several instructions that do basically the same. For example I can do bitwise OR with por, orps or orpd. Can anyone explain what's the purpose of these additional instructions? I guess this might be related to different scheduling algorithms applied to each instruction.
From an expert (obviously not me :P): http://www.agner.org/optimize/optimizing_assembly.pdf [13.2 Using vector instructions with other types of data than they are intended for (pages 118-119)]:
There is a penalty for using the wrong type of instructions on some processors. This is
because the processor may have different data buses or different execution units for integer
and floating point data. Moving data between the integer and floating point units can take
one or more clock cycles depending on the processor, as listed in table 13.2.
Processor Bypass delay, clock cycles
Intel Core 2 and earlier 1
Intel Nehalem 2
Intel Sandy Bridge and later 0-1
Intel Atom 0
AMD 2
VIA Nano 2-3
Table 13.2. Data bypass delays between integer and floating point execution units

C - the limits of speed of the Desktop-CPUs if program is build using GCC with all optimization flags?

We are planning to port a big part of our Digital Signal Processing routines from hardware-specific chips to the common desktop CPU architecture like Quad-Core or so. I am trying to estimate the limits of such architecture for a program build with GCC. I am mostly interested in a high SDRAM-CPU bandwidth [Gb/sec] and in a high number of the 32-Bit IEEE-754 floating point Multiply-Accumulate operations per second.
I have selected a typical representative of the modern desktop CPUs - Quad Core, about 10Mb cache, 3GHz, 45nm. Can you please help me to find out its limits:
1) Highest possible Multiply-Accumulate operations per second if CPU's specific instructions which GCC supports using input flags will be used and all cores will be used. The source code itself must not require changes if we decide to port it to the different CPU-architecture like Altivec on PowerPC - the best option is to use GCC flags like -msse or -maltivec. I suggest also, a program has to have 4 threads in order to utilize all available cores, right?
2) SDRAM-CPU bandwidth (highest limit, so indep. on the mainboard).
UPDATE: Since GCC 3, GCC can automatically generate SSE/SSE2 scalar code when the target supports those instructions. Automatic vectorization for SSE/SSE2 has been added since GCC 4. SSE4.1 introduces DPPS, DPPD instructions - Dot product for Array of Structs data. New 45nm Intel processors support SSE4 instructions.
First off, know that it will most likely not be possible for your code to both run as fast as possible on modern vector FPU units and be completely portable across architectures. It is possible to abstract away some aspects of the architectures via macros, etc, but compilers are (at present) capable of generating nearly optimal auto-vectorized code only for very simple programs.
Now, on to your questions: current x86 hardware does not have a multiply-accumulate, but is capable of one vector add and one vector multiply per cycle per core. Assuming that your code achieves full computational density, and you either hand-write vector code or your code is simple enough for the compiler to handle the task, the peak throughput that can be achieved independent of memory access latency is:
number of cores * cycles per second * flops per cycle * vector width
Which in your case sounds like:
4 * 3.2 GHz * 2 vector flops/cycle * 4 floats/vector = 102.4 Gflops
If you are going to write scalar code, divide that by four. If you are going to write vector code in C with some level of portable abstraction, plan to be leaving some performance on the table, but you can certainly go substantially faster than scalar code will allow. 50% of theoretical peak is a conservative guess (I would expect to do better assuming the algorithms are amenable to vectorization, but make sure you have some headroom in your estimates).
edit: notes on DPPS:
DPPS is not a multiply-add, and using it as one is a performance hazard on current architectures. Looking it up in the Intel Optimization Manual, you will find that it has a latency of 11 cycles, and throughput is only one vector result every two cycles. DPPS does up to four multiplies and three adds, so you're getting 2 multiplies per cycle and 1.5 adds, whereas using MULPS and ADDPS would get you 4 of each every cycle.
More generally, horizontal vector operations should be avoided unless absolutely necessary; lay out your data so that your operations stay within vector lanes to the maximum extent possible.
In fairness to Intel, if you can't change your data layout, and DPPS happens to be exactly the operation that you need, then you want to use it. Just be aware that you're limiting yourself to less than 50% of peak FP throughput right off the bat by doing so.
This may not directly answer your question, but have you considered using the PC's graphics cards for parallel floating-point computations? It's getting to the point where GPUs will outperform CPUs for some tasks; and the nice thing is that graphics cards are reasonably competitively priced.
I'm short on details, sorry; this is just to give you an idea.
Some points you should consider:
1) Intel's i7-architecture is in the moment your fastest options for 1 or 2 CPUs. Only for 4 or more sockets AMD's Opterons can compete.
2) Intel's compilers generate code that is often significantly faster that code generated by other compilers (when used on AMD's CPUs you have to patch away some CPU checks Intel puts in to prevent AMD to look good).
3) No x86-CPU supports multiply-and-add yet, AMD's next architecure "Bulldozer" will probably be the first to support it.
4) High memory bandwidth you get on any AMD CPU and on Intel only for the new i7-architecture (socket 1366 is better than 775).
5) Use Intel's highly efficient libraries
if possible.

Resources