SIMD micro-architecture - arm

I'm trying to understand the difference between Vector Processor and SIMD architectures such as ARM NEON. I know that there is a difference in vector register length configurability between these two. However, I'm not sure how their microarchitecture can be different? Is it the case that for SIMD machines we need to have as many processing units as the number of elements each instruction operate on? Or just like vector processors, we can have lesser number of processing units than the number of data elements in a vector register and just need to use a sequencer to complete an instruction in multiple cycles?
Thanks

You can implement short-vector SIMD (like NEON or x86 SSE) with narrower hardware that has to decode each instruction to 2 internal operations, for example.
Intel did this with 128-bit SSE vectors on Pentium 3 through Pentium M, with Pentium 4 and Core 2 being the first microarchitectures to have full-width SIMD execution units.
But the decoding is not data-dependent so you don't need a full microcode sequencer.

the difference between Vector Processor and SIMD
I dunno your definition of vector processor, but wikipedia says SIMD is one type of them.
Is it the case that for SIMD machines we need to have as many processing units as the number of elements each instruction operate on?
Some CPUs split SIMD register into parts, and process them independently. Intel Pentium III split 128-bit SSE operations in 64-bit pieces, AMD Zen does the same with 256-bit AVX instructions, splits them into 128-bit pieces.
need to use a sequencer to complete an instruction in multiple cycles?
Just because they’re split doesn’t mean they run sequentially. All modern CPUs, including ARM, have multiple execution units (EUs) per core. Micro-ops can run in parallel on different EUs, but these EUs aren’t equal. Since I mentioned AMD Zen, here’s a link. The core can start executing up to 10 different micro-ops per cycle: 4 integers (all can do add or bitwise, 2 of them can multiply/divide, 2 of them can branch), 2 integer load/stores, 4 128-bit floating point operations (two can add, other two can multiply, two can AES encrypt). It can finish up to 16 instructions/cycle, 8 integers, 8 floats. Different micro-ops take different count of cycles to complete.

Related

Will Knights Landing CPU (Xeon Phi) accelerate byte/word integer code?

The Intel Xeon Phi "Knights Landing" processor will be the first to support AVX-512, but it will only support "F" (like SSE without SSE2, or AVX without AVX2), so floating-point stuff mainly.
I'm writing software that operates on bytes and words (8- and 16-bit) using up to SSE4.1 instructions via intrinsics.
I am confused whether there will be EVEX-encoded versions of all/most SSE4.1 instructions in AVX-512F, and whether this means I can expect my SSE code to automatically gain EVEX-extended instructions and map to all new registers.
Wikipedia says this:
The width of the SIMD register file is increased from 256 bits to 512 bits, with a total of 32 registers ZMM0-ZMM31. These registers can be addressed as 256 bit YMM registers from AVX extensions and 128-bit XMM registers from Streaming SIMD Extensions, and legacy AVX and SSE instructions can be extended to operate on the 16 additional registers XMM16-XMM31 and YMM16-YMM31 when using EVEX encoded form.
This unfortunately does not clarify whether compiling SSE4 code with AVX512-enabled will lead to the same (awesome) speedup that compiling it to AVX2 provides (VEX coding of legacy instructions).
Anybody know what will happen when SSE2/4 code (C intrinsics) are compiled for AVX-512F? Could one expect a speed bump like with AVX1's VEX coding of the byte and word instructions?
Okay, I think I've pieced together enough information to make a decent answer. Here goes.
What will happen when native SSE2/4 code is run on Knights Landing (KNL)?
The code will run in the bottom fourth of the registers on a single VPU (called the compatibility layer) within a core. According to a pre-release webinar from Colfax, this means occupying only 1/4 to 1/8 of the total register space available to a core and running in legacy mode.
What happens if the same code is recompiled with compiler flags for AVX-512F?
SSE2/4 code will be generated with VEX prefix. That means pshufb becomes vpshufb and works with other AVX code in ymm. Instructions will NOT be promoted to AVX512's native EVEX or allowed to address the new zmm registers specifically. Instructions can only be promoted to EVEX with AVX512-VL, in which case they gain the ability to directly address (renamed) zmm registers. It is unknown whether register sharing is possible at this point, but pipelining on AVX2 has demonstrated similar throughput with half-width AVX2 (AVX-128) as with full 256-bit AVX2 code in many cases.
Most importantly, how do I get my SSE2/4/AVX128 byte/word size code running on AVX512F?
You'll have to load 128-bit chunks into xmm, sign/zero extend those bytes/words into 32-bit in zmm, and operate as if they were always larger integers. Then when finished, convert back to bytes/words.
Is this fast?
According to material published on Larrabee (Knights Landing's prototype), type conversions of any integer width are free from xmm to zmm and vice versa, so long as registers are available. Additionally, after calculations are performed, the 32-bit results can be truncated on the fly down to byte/word length and written (packed) to unaligned memory in 128-bit chunks, potentially saving an xmm register.
On KNL, each core has 2 VPUs that seem to be capable of talking to each other. Hence, 32-way 32-bit lookups are possible in a single vperm*2d instruction of presumably reasonable throughput. This is not possible even with AVX2, which can only permute within 128-bit lanes (or between lanes for the 32-bit vpermd only, which is inapplicable to byte/word instructions). Combined with free type conversions, the ability to use masks implicitly with AVX512 (sparing the costly and register-intensive use of blendv or explicit mask generation), and the presence of more comparators (native NOT, unsigned/signed lt/gt, etc), it may provide a reasonable performance boost to rewrite SSE2/4 byte/word code for AVX512F after all. At least on KNL.
Don't worry, I'll test the moment I get my hands on mine. ;-)

Is __int128_t arithmetic emulated by GCC, even with SSE?

I've heard that the 128-bit integer data-types like __int128_t provided by GCC are emulated and therefore slow. However, I understand that the various SSE instruction sets (SSE, SSE2, ..., AVX) introduced at least some instructions for 128-bit registers. I don't know very much about SSE or assembly / machine code, so I was wondering if someone could explain to me whether arithmetic with __int128_t is emulated or not using modern versions of GCC.
The reason I'm asking this is because I'm wondering if it makes sense to expect big differences in __int128_t performance between different versions of GCC, depending on what SSE instructions are taken advantage of.
So, what parts of __int128_t arithmetic are emulated by GCC, and what parts are implemented with SSE instructions (if any)?
I was confusing two different things in my question.
Firstly, as PaulR explained in the comments: "There are no 128 bit arithmetic operations in SSE or AVX (apart from bitwise operations)". Considering this, 128-bit arithmetic has to be emulated on modern x86-64 based processors (e.g. AMD Family 10 or Intel Core architecture). This has nothing to do with GCC.
The second part of the question is whether or not 128-bit arithmetic emulation in GCC benefits from SSE/AVX instructions or registers. As implied in PaulR's comments, there isn't much in SSE/AVX that's going to allow you to do 128-bit arithmetic more easily; most likely x86-64 instructions will be used for this. The code I'm interested in can't compile with -mno-sse, but it compiles fine with -mno-sse2 -mno-sse3 -mno-ssse3 -mno-sse4 -mno-sse4.1 -mno-sse4.2 -mno-avx -mno-avx2 and performance isn't affected. So my code doesn't benefit from modern SSE instructions.
SSE2-AVX instructions are available for 8,16,32,64-bit integer data types. They are mostly intended to treat packed data together, for example, 128-bit register may contain four 32-bit integers and so on.
Although SSE/AVX/AVX-512/etc. have no 128-bit mode (their vector elements are strictly 64-bit max, and operations will simply overflow), as Paul R has implied, the main CPU does support limited 128-bit operations, by using a pair of registers.
When multiplying two regular 64-bit number, MUL/IMUL can outputs its 128-bit result in the RAX/RDX register pair.
Inversely, when dividing DIV/IDIV can take its input from then RAX/RDX pair to divide a 128-bit number by a 64-bit divisor (and outputs 64-bit quotient + 64-bit modulo)
Of course the CPU's ALU is 64-bit, thus - as implied Intel docs - these higher extra 64-bit come at the cost of extra micro-ops in the microcode. This is more dramatic for divisions (> 3x more) which already require lots of micro-ops to be processed.
Still that means that under some circumstances (like using a rule of three to scale a value), it's possible for a compiler to emit regular CPU instruction and not care to do any 128-bit emulation by itself.
This has been available for a long time:
since 80386, 32-bit CPU could do 64-bit multiplication/division using EAX:EDX pair
since 8086/88, 16-bit CPU could do 32-bit multiplication/division using AX:DX pair
(As for additions and subtraction: thank to the support for carry, it's completely trivial to do additions/subtractions of numbers of any arbitrary length that can fill your storage).

Why does ARM have 16 registers?

Why does ARM have only 16 registers? Is that the ideal number?
Does distance of registers with more registers also increase the processing time/power ?
As the number of the general-purpose registers becomes smaller, you need to start using the stack for variables. Using the stack requires more instructions, so code size increases. Using the stack also increases the number of memory accesses, which hurts both performance and power usage. The trade off is that to represent more registers you need more bits in your instruction, and you need more room on the chip for the register file, which increases power requirements. You can see how differing register counts affects code size and the frequency of load/store instructions by compiling the same set of code with different numbers of registers. The result of that type of exercise can be seen in table 1 of this paper:
Extendable Instruction Set Computing
Register Program Load/Store
Count Size Frequency
27 100.00 27.90%
16 101.62 30.22%
8 114.76 44.45%
(They used 27 as a base because that is the number of GPRs available on a MIPS processor)
As you can see, there are only marginal improvements in both programs size and the number of load/stores required as you drop the register count down to 16. The real penalties don't kick in until you drop down to 8 registers. I suspect ARM designers felt that 16 registers was a kind of sweet spot when you were looking for the best performance per watt.
To choose one of 16 registers you would need 4bit therefore it could be that this is the best match for opcodes (machine commands) otherwise you would have to introduce a more complex instructions set, which would lead to bigger coder which implies additional costs (execution time).
Wikipedia says It has "Fixed instruction width of 32 bits to ease decoding and pipelining"
so it is a reasonable tradeoff.
32-bit ARM has 16 registers because it only use 4 bits for encoding the register, not because 16 is the ideal number. Likewise x86 has only 8 registers because in history they used 3 bits to encode the register so that some instructions fit in a byte.
That's such a limited number so both x86 and ARM when going to 64-bit doubled the number to 16 and 32 registers respectively. The old ARM instruction encoding has no remaining bit left enough for the larger register number so they must do a trade-off by dropping the ability to execute almost every instruction conditionally and use the 4-bit condition for the new features (that's an oversimplification, in reality it's not exactly like that because the encoding is new, but you do need 3 more bits for the new registers).
Back in the 80's (IIRC) an academic paper was published that examined a number of different workloads, comparing expected performance benefits of different numbers of registers. This was at a time when RISC processors were transitioning from academic ideas to mainstream hardware, and it was important to decide what was optimal. CPUs were already pulling ahead of memory in speed, and RISC was making this worse by limiting addressing modes and having separate load and store instructions. Having more registers meant you could "cache" more data for immediate access and therefore access main memory less.
Considering only powers of two, it was found that 32 registers was optimal, although 16 wasn't terribly far behind.
ARM is unique in that each of the registers can have a conditional execution code avoiding tests & branches. Don't forget, many 32 register machines fix R0 to 0 so conditional tests are done by comparing to R0. I know from experience. 20 years ago I had to program a 'Mode 7' (from SNES terminology) floor. The CPUs were SH2 for the 32x (or rather 2 of them), MIPS3000 (Playstation) and 3DO (ARM), the inner loop of the code were 19,15 & 11. If the 3DO had been running at the same speed as the other 2, it would have been twice as fast. As it was, it was just a bit slower.

Do I get a performance penalty when mixing SSE integer/float SIMD instructions

I've used x86 SIMD instructions (SSE1234) in the form of intrinsics quite a lot lately. What I found frustrating is that the SSE ISA has several simple instructions that are available only for floats or only for integers, but in theory should perform equally for both. For example, both float and double vectors have instructions to load higher 64bits of a 128-bit vector from an address (movhps, movhpd), but there's no such instruction for integer vectors.
My question:
Is there any reasons to expect a performance hit when using floating point instructions on integer vectors, e.g. using movhps to load data to an integer vector?
I wrote several tests to check that, but I suppose their results are not credible. It's really hard to write a correct test that explores all corner cases for such things, especially when the instruction scheduling is most probably involved here.
Related question:
Other trivially similar things also have several instructions that do basically the same. For example I can do bitwise OR with por, orps or orpd. Can anyone explain what's the purpose of these additional instructions? I guess this might be related to different scheduling algorithms applied to each instruction.
From an expert (obviously not me :P): http://www.agner.org/optimize/optimizing_assembly.pdf [13.2 Using vector instructions with other types of data than they are intended for (pages 118-119)]:
There is a penalty for using the wrong type of instructions on some processors. This is
because the processor may have different data buses or different execution units for integer
and floating point data. Moving data between the integer and floating point units can take
one or more clock cycles depending on the processor, as listed in table 13.2.
Processor Bypass delay, clock cycles
Intel Core 2 and earlier 1
Intel Nehalem 2
Intel Sandy Bridge and later 0-1
Intel Atom 0
AMD 2
VIA Nano 2-3
Table 13.2. Data bypass delays between integer and floating point execution units

C - the limits of speed of the Desktop-CPUs if program is build using GCC with all optimization flags?

We are planning to port a big part of our Digital Signal Processing routines from hardware-specific chips to the common desktop CPU architecture like Quad-Core or so. I am trying to estimate the limits of such architecture for a program build with GCC. I am mostly interested in a high SDRAM-CPU bandwidth [Gb/sec] and in a high number of the 32-Bit IEEE-754 floating point Multiply-Accumulate operations per second.
I have selected a typical representative of the modern desktop CPUs - Quad Core, about 10Mb cache, 3GHz, 45nm. Can you please help me to find out its limits:
1) Highest possible Multiply-Accumulate operations per second if CPU's specific instructions which GCC supports using input flags will be used and all cores will be used. The source code itself must not require changes if we decide to port it to the different CPU-architecture like Altivec on PowerPC - the best option is to use GCC flags like -msse or -maltivec. I suggest also, a program has to have 4 threads in order to utilize all available cores, right?
2) SDRAM-CPU bandwidth (highest limit, so indep. on the mainboard).
UPDATE: Since GCC 3, GCC can automatically generate SSE/SSE2 scalar code when the target supports those instructions. Automatic vectorization for SSE/SSE2 has been added since GCC 4. SSE4.1 introduces DPPS, DPPD instructions - Dot product for Array of Structs data. New 45nm Intel processors support SSE4 instructions.
First off, know that it will most likely not be possible for your code to both run as fast as possible on modern vector FPU units and be completely portable across architectures. It is possible to abstract away some aspects of the architectures via macros, etc, but compilers are (at present) capable of generating nearly optimal auto-vectorized code only for very simple programs.
Now, on to your questions: current x86 hardware does not have a multiply-accumulate, but is capable of one vector add and one vector multiply per cycle per core. Assuming that your code achieves full computational density, and you either hand-write vector code or your code is simple enough for the compiler to handle the task, the peak throughput that can be achieved independent of memory access latency is:
number of cores * cycles per second * flops per cycle * vector width
Which in your case sounds like:
4 * 3.2 GHz * 2 vector flops/cycle * 4 floats/vector = 102.4 Gflops
If you are going to write scalar code, divide that by four. If you are going to write vector code in C with some level of portable abstraction, plan to be leaving some performance on the table, but you can certainly go substantially faster than scalar code will allow. 50% of theoretical peak is a conservative guess (I would expect to do better assuming the algorithms are amenable to vectorization, but make sure you have some headroom in your estimates).
edit: notes on DPPS:
DPPS is not a multiply-add, and using it as one is a performance hazard on current architectures. Looking it up in the Intel Optimization Manual, you will find that it has a latency of 11 cycles, and throughput is only one vector result every two cycles. DPPS does up to four multiplies and three adds, so you're getting 2 multiplies per cycle and 1.5 adds, whereas using MULPS and ADDPS would get you 4 of each every cycle.
More generally, horizontal vector operations should be avoided unless absolutely necessary; lay out your data so that your operations stay within vector lanes to the maximum extent possible.
In fairness to Intel, if you can't change your data layout, and DPPS happens to be exactly the operation that you need, then you want to use it. Just be aware that you're limiting yourself to less than 50% of peak FP throughput right off the bat by doing so.
This may not directly answer your question, but have you considered using the PC's graphics cards for parallel floating-point computations? It's getting to the point where GPUs will outperform CPUs for some tasks; and the nice thing is that graphics cards are reasonably competitively priced.
I'm short on details, sorry; this is just to give you an idea.
Some points you should consider:
1) Intel's i7-architecture is in the moment your fastest options for 1 or 2 CPUs. Only for 4 or more sockets AMD's Opterons can compete.
2) Intel's compilers generate code that is often significantly faster that code generated by other compilers (when used on AMD's CPUs you have to patch away some CPU checks Intel puts in to prevent AMD to look good).
3) No x86-CPU supports multiply-and-add yet, AMD's next architecure "Bulldozer" will probably be the first to support it.
4) High memory bandwidth you get on any AMD CPU and on Intel only for the new i7-architecture (socket 1366 is better than 775).
5) Use Intel's highly efficient libraries
if possible.

Resources