The Intel Xeon Phi "Knights Landing" processor will be the first to support AVX-512, but it will only support "F" (like SSE without SSE2, or AVX without AVX2), so floating-point stuff mainly.
I'm writing software that operates on bytes and words (8- and 16-bit) using up to SSE4.1 instructions via intrinsics.
I am confused whether there will be EVEX-encoded versions of all/most SSE4.1 instructions in AVX-512F, and whether this means I can expect my SSE code to automatically gain EVEX-extended instructions and map to all new registers.
Wikipedia says this:
The width of the SIMD register file is increased from 256 bits to 512 bits, with a total of 32 registers ZMM0-ZMM31. These registers can be addressed as 256 bit YMM registers from AVX extensions and 128-bit XMM registers from Streaming SIMD Extensions, and legacy AVX and SSE instructions can be extended to operate on the 16 additional registers XMM16-XMM31 and YMM16-YMM31 when using EVEX encoded form.
This unfortunately does not clarify whether compiling SSE4 code with AVX512-enabled will lead to the same (awesome) speedup that compiling it to AVX2 provides (VEX coding of legacy instructions).
Anybody know what will happen when SSE2/4 code (C intrinsics) are compiled for AVX-512F? Could one expect a speed bump like with AVX1's VEX coding of the byte and word instructions?
Okay, I think I've pieced together enough information to make a decent answer. Here goes.
What will happen when native SSE2/4 code is run on Knights Landing (KNL)?
The code will run in the bottom fourth of the registers on a single VPU (called the compatibility layer) within a core. According to a pre-release webinar from Colfax, this means occupying only 1/4 to 1/8 of the total register space available to a core and running in legacy mode.
What happens if the same code is recompiled with compiler flags for AVX-512F?
SSE2/4 code will be generated with VEX prefix. That means pshufb becomes vpshufb and works with other AVX code in ymm. Instructions will NOT be promoted to AVX512's native EVEX or allowed to address the new zmm registers specifically. Instructions can only be promoted to EVEX with AVX512-VL, in which case they gain the ability to directly address (renamed) zmm registers. It is unknown whether register sharing is possible at this point, but pipelining on AVX2 has demonstrated similar throughput with half-width AVX2 (AVX-128) as with full 256-bit AVX2 code in many cases.
Most importantly, how do I get my SSE2/4/AVX128 byte/word size code running on AVX512F?
You'll have to load 128-bit chunks into xmm, sign/zero extend those bytes/words into 32-bit in zmm, and operate as if they were always larger integers. Then when finished, convert back to bytes/words.
Is this fast?
According to material published on Larrabee (Knights Landing's prototype), type conversions of any integer width are free from xmm to zmm and vice versa, so long as registers are available. Additionally, after calculations are performed, the 32-bit results can be truncated on the fly down to byte/word length and written (packed) to unaligned memory in 128-bit chunks, potentially saving an xmm register.
On KNL, each core has 2 VPUs that seem to be capable of talking to each other. Hence, 32-way 32-bit lookups are possible in a single vperm*2d instruction of presumably reasonable throughput. This is not possible even with AVX2, which can only permute within 128-bit lanes (or between lanes for the 32-bit vpermd only, which is inapplicable to byte/word instructions). Combined with free type conversions, the ability to use masks implicitly with AVX512 (sparing the costly and register-intensive use of blendv or explicit mask generation), and the presence of more comparators (native NOT, unsigned/signed lt/gt, etc), it may provide a reasonable performance boost to rewrite SSE2/4 byte/word code for AVX512F after all. At least on KNL.
Don't worry, I'll test the moment I get my hands on mine. ;-)
Related
I am somehow confused by the MOVSD assembly instruction. I wrote some numerical code computing some matrix multiplication, simply using ordinary C code with no SSE intrinsics. I do not even include the header file for SSE2 intrinsics for compilation. But when I check the assembler output, I see that:
1) 128-bit vector registers XMM are used;
2) SSE2 instruction MOVSD is invoked.
I understand that MOVSD essentially operates on single double precision floating point. It only uses the lower 64-bit of an XMM register and set the upper 64-bit 0. But I just don't understand two things:
1) I never give the compiler any hint for using SSE2. Plus, I am using GCC not intel compiler. As far as I know, intel compiler will automatically seek opportunities for vectorization, but GCC will not. So how does GCC know to use MOVSD?? Or, has this x86 instruction been around long before SSE instruction set, and the _mm_load_sd() intrinsics in SSE2 is just to provide backward compatibility for using XMM registers for scalar computation?
2) Why does not the compiler use other floating point registers, either the 80-bit floating point stack, or 64-bit floating point registers?? Why must it take the toll using XMM register (by setting upper 64-bit 0 and essentially wasting that storage)? Does XMM do provide faster access??
By the way, I have another question regarding SSE2. I just can't see the difference between _mm_store_sd() and _mm_storel_sd(). Both store the lower 64-bit value to an address. What is the difference? Performance difference?? Alignment difference??
Thank you.
Update 1:
OKAY, obviously when I first asked this question, I lacked some basic knowledge on how a CPU manages floating point operations. So experts tend to think my question is non-sense. Since I did not include even the shortest sample C code, people might think this question vague as well. Here I would provide a review as an answer, which hopefully will be useful to any people unclear about the floating point operations on modern CPUs.
A review of floating point scalar/vector processing on modern CPUs
The idea of vector processing dates back to old time vector processors, but these processors had been superseded by modern architectures with cache systems. So we focus on modern CPUs, especially x86 and x86-64. These architectures are the main stream in high performance scientific computing.
Since i386, Intel introduced the floating point stack where floating point numbers up to 80-bit wide can be held. This stack is commonly known as x87 or 387 floating point "registers", with a set of x87 FPU instructions. x87 stack are not real, directly addressable registers like general purpose registers, as they are on a stack. Access to register st(i) is by offsetting the stack top register %st(0) or simply %st. With help of an instruction FXCH which swaps the contents between current stack top %st and some offset register %st(i), random access can be achieved. But FXCH can impose some performance penalty, though minimized. x87 stack provides high precision computation by calculating intermediate results with 80 bits of precision by default, to minimise roundoff error in numerically unstable algorithms. However, x87 instructions are completely scalar.
The first effort on vectorization is the MMX instruction set, which implemented integer vector operations. The vector registers under MMX are 64-bit wide registers MMX0, MMX1, ..., MMX7. Each can be used to hold either 64-bit integers, or multiple smaller integers in a "packed" format. A single instruction can then be applied to two 32-bit integers, four 16-bit integers, or eight 8-bit integers at once. So now there are the legacy general purpose registers for scalar integer operations, as well as new MMX for integer vector operations with no shared execution resources. But MMX shared execution resources with scalar x87 FPU operation: each MMX register corresponded to the lower 64 bits of an x87 register, and the upper 16 bits of the x87 registers is unused. These MMX registers were each directly addressable. But the aliasing made it difficult to work with floating point and integer vector operations in the same application. To maximize performance, programmers often used the processor exclusively in one mode or the other, deferring the relatively slow switch between them as long as possible.
Later, SSE created a separate set of 128-bit wide registers XMM0–XMM7 along side of x87 stack. SSE instructions focused exclusively on single-precision floating-point operations (32-bit); integer vector operations were still performed using the MMX register and MMX instruction set. But now both operations can proceed at the same time, as they share no execution resources. It is important to know that SSE not only do floating point vector operations, but also floating point scalar operations. Essentially it provides a new place where floating operations take place, and the x87 stack is no longer prior choice to carry out floating operations. Using XMM registers for scalar floating point operations is faster than using x87 stack, as all XMM registers are easier to access, while the x87 stack can't be randomly accessed without FXCH. When I posted my question, I was clearly unaware of this fact. The other concept I was not clear about is that general purpose registers are integer/address registers. Even if they are 64-bit on x86-64, they can not hold 64-bit floating point. The main reason is that the execution unit associated with general purpose registers is ALU (arithmetic & logical unit), which is not for floating point computation.
SSE2 is a major progress, as it extends vector data type, so SSE2 instructions, either scalar or vector, can work with all C standard data type. Such extension in fact makes MMX obsolete. Also, x87 stack is no long as important as it once was. Since there are two alternative places where floating point operations can take place, you can specify your option to the compiler. For example for GCC, compilation with flag
-mfpmath=387
will schedule floating point operations on the legacy x87 stack. Note that this seems to be the default for 32-bit x86, even if SSE is already available. For example, I have an Intel Core2Duo laptop made in 2007, and it was already equipped with SSE release up to version SSE4, while GCC will still by default use x87 stack, which makes scientific computations unnecessarily slower. In this case, we need compile with flag
-mfpmath=sse
and GCC will schedule floating point operations on XMM registers. 64-bit x86-64 user needs not worry about such configuration as this is default on x86-64. Such signal will only affect scalar floating point operation. If we have written code using vector instructions and compiler the code with flag
-msse2
then XMM registers will be the only place where computation can take place. In other words, this flags turns on -mfpmath=sse. For more information see GCC's configuration of x86, x86-64. For examples of writing SSE2 C code, see my other post How to ask GCC to completely unroll this loop (i.e., peel this loop)?.
SSE set of instructions, though very useful, are not the latest vector extensions. The AVX, advanced vector extensions enhances SSE by providing 3-operands and 4 operands instructions. See number of operands in instruction set if you are unclear of what this means. 3-operands instruction optimizes the commonly seen fused multiply-add (FMA) operation in scientific computing by 1) using 1 fewer register; 2) reducing the explicit amount of data movement between registers; 3) speeding up FMA computations in itself. For example of using AVX, see #Nominal Animal's answer to my post.
I've heard that the 128-bit integer data-types like __int128_t provided by GCC are emulated and therefore slow. However, I understand that the various SSE instruction sets (SSE, SSE2, ..., AVX) introduced at least some instructions for 128-bit registers. I don't know very much about SSE or assembly / machine code, so I was wondering if someone could explain to me whether arithmetic with __int128_t is emulated or not using modern versions of GCC.
The reason I'm asking this is because I'm wondering if it makes sense to expect big differences in __int128_t performance between different versions of GCC, depending on what SSE instructions are taken advantage of.
So, what parts of __int128_t arithmetic are emulated by GCC, and what parts are implemented with SSE instructions (if any)?
I was confusing two different things in my question.
Firstly, as PaulR explained in the comments: "There are no 128 bit arithmetic operations in SSE or AVX (apart from bitwise operations)". Considering this, 128-bit arithmetic has to be emulated on modern x86-64 based processors (e.g. AMD Family 10 or Intel Core architecture). This has nothing to do with GCC.
The second part of the question is whether or not 128-bit arithmetic emulation in GCC benefits from SSE/AVX instructions or registers. As implied in PaulR's comments, there isn't much in SSE/AVX that's going to allow you to do 128-bit arithmetic more easily; most likely x86-64 instructions will be used for this. The code I'm interested in can't compile with -mno-sse, but it compiles fine with -mno-sse2 -mno-sse3 -mno-ssse3 -mno-sse4 -mno-sse4.1 -mno-sse4.2 -mno-avx -mno-avx2 and performance isn't affected. So my code doesn't benefit from modern SSE instructions.
SSE2-AVX instructions are available for 8,16,32,64-bit integer data types. They are mostly intended to treat packed data together, for example, 128-bit register may contain four 32-bit integers and so on.
Although SSE/AVX/AVX-512/etc. have no 128-bit mode (their vector elements are strictly 64-bit max, and operations will simply overflow), as Paul R has implied, the main CPU does support limited 128-bit operations, by using a pair of registers.
When multiplying two regular 64-bit number, MUL/IMUL can outputs its 128-bit result in the RAX/RDX register pair.
Inversely, when dividing DIV/IDIV can take its input from then RAX/RDX pair to divide a 128-bit number by a 64-bit divisor (and outputs 64-bit quotient + 64-bit modulo)
Of course the CPU's ALU is 64-bit, thus - as implied Intel docs - these higher extra 64-bit come at the cost of extra micro-ops in the microcode. This is more dramatic for divisions (> 3x more) which already require lots of micro-ops to be processed.
Still that means that under some circumstances (like using a rule of three to scale a value), it's possible for a compiler to emit regular CPU instruction and not care to do any 128-bit emulation by itself.
This has been available for a long time:
since 80386, 32-bit CPU could do 64-bit multiplication/division using EAX:EDX pair
since 8086/88, 16-bit CPU could do 32-bit multiplication/division using AX:DX pair
(As for additions and subtraction: thank to the support for carry, it's completely trivial to do additions/subtractions of numbers of any arbitrary length that can fill your storage).
Why does ARM have only 16 registers? Is that the ideal number?
Does distance of registers with more registers also increase the processing time/power ?
As the number of the general-purpose registers becomes smaller, you need to start using the stack for variables. Using the stack requires more instructions, so code size increases. Using the stack also increases the number of memory accesses, which hurts both performance and power usage. The trade off is that to represent more registers you need more bits in your instruction, and you need more room on the chip for the register file, which increases power requirements. You can see how differing register counts affects code size and the frequency of load/store instructions by compiling the same set of code with different numbers of registers. The result of that type of exercise can be seen in table 1 of this paper:
Extendable Instruction Set Computing
Register Program Load/Store
Count Size Frequency
27 100.00 27.90%
16 101.62 30.22%
8 114.76 44.45%
(They used 27 as a base because that is the number of GPRs available on a MIPS processor)
As you can see, there are only marginal improvements in both programs size and the number of load/stores required as you drop the register count down to 16. The real penalties don't kick in until you drop down to 8 registers. I suspect ARM designers felt that 16 registers was a kind of sweet spot when you were looking for the best performance per watt.
To choose one of 16 registers you would need 4bit therefore it could be that this is the best match for opcodes (machine commands) otherwise you would have to introduce a more complex instructions set, which would lead to bigger coder which implies additional costs (execution time).
Wikipedia says It has "Fixed instruction width of 32 bits to ease decoding and pipelining"
so it is a reasonable tradeoff.
32-bit ARM has 16 registers because it only use 4 bits for encoding the register, not because 16 is the ideal number. Likewise x86 has only 8 registers because in history they used 3 bits to encode the register so that some instructions fit in a byte.
That's such a limited number so both x86 and ARM when going to 64-bit doubled the number to 16 and 32 registers respectively. The old ARM instruction encoding has no remaining bit left enough for the larger register number so they must do a trade-off by dropping the ability to execute almost every instruction conditionally and use the 4-bit condition for the new features (that's an oversimplification, in reality it's not exactly like that because the encoding is new, but you do need 3 more bits for the new registers).
Back in the 80's (IIRC) an academic paper was published that examined a number of different workloads, comparing expected performance benefits of different numbers of registers. This was at a time when RISC processors were transitioning from academic ideas to mainstream hardware, and it was important to decide what was optimal. CPUs were already pulling ahead of memory in speed, and RISC was making this worse by limiting addressing modes and having separate load and store instructions. Having more registers meant you could "cache" more data for immediate access and therefore access main memory less.
Considering only powers of two, it was found that 32 registers was optimal, although 16 wasn't terribly far behind.
ARM is unique in that each of the registers can have a conditional execution code avoiding tests & branches. Don't forget, many 32 register machines fix R0 to 0 so conditional tests are done by comparing to R0. I know from experience. 20 years ago I had to program a 'Mode 7' (from SNES terminology) floor. The CPUs were SH2 for the 32x (or rather 2 of them), MIPS3000 (Playstation) and 3DO (ARM), the inner loop of the code were 19,15 & 11. If the 3DO had been running at the same speed as the other 2, it would have been twice as fast. As it was, it was just a bit slower.
I've used x86 SIMD instructions (SSE1234) in the form of intrinsics quite a lot lately. What I found frustrating is that the SSE ISA has several simple instructions that are available only for floats or only for integers, but in theory should perform equally for both. For example, both float and double vectors have instructions to load higher 64bits of a 128-bit vector from an address (movhps, movhpd), but there's no such instruction for integer vectors.
My question:
Is there any reasons to expect a performance hit when using floating point instructions on integer vectors, e.g. using movhps to load data to an integer vector?
I wrote several tests to check that, but I suppose their results are not credible. It's really hard to write a correct test that explores all corner cases for such things, especially when the instruction scheduling is most probably involved here.
Related question:
Other trivially similar things also have several instructions that do basically the same. For example I can do bitwise OR with por, orps or orpd. Can anyone explain what's the purpose of these additional instructions? I guess this might be related to different scheduling algorithms applied to each instruction.
From an expert (obviously not me :P): http://www.agner.org/optimize/optimizing_assembly.pdf [13.2 Using vector instructions with other types of data than they are intended for (pages 118-119)]:
There is a penalty for using the wrong type of instructions on some processors. This is
because the processor may have different data buses or different execution units for integer
and floating point data. Moving data between the integer and floating point units can take
one or more clock cycles depending on the processor, as listed in table 13.2.
Processor Bypass delay, clock cycles
Intel Core 2 and earlier 1
Intel Nehalem 2
Intel Sandy Bridge and later 0-1
Intel Atom 0
AMD 2
VIA Nano 2-3
Table 13.2. Data bypass delays between integer and floating point execution units
We are planning to port a big part of our Digital Signal Processing routines from hardware-specific chips to the common desktop CPU architecture like Quad-Core or so. I am trying to estimate the limits of such architecture for a program build with GCC. I am mostly interested in a high SDRAM-CPU bandwidth [Gb/sec] and in a high number of the 32-Bit IEEE-754 floating point Multiply-Accumulate operations per second.
I have selected a typical representative of the modern desktop CPUs - Quad Core, about 10Mb cache, 3GHz, 45nm. Can you please help me to find out its limits:
1) Highest possible Multiply-Accumulate operations per second if CPU's specific instructions which GCC supports using input flags will be used and all cores will be used. The source code itself must not require changes if we decide to port it to the different CPU-architecture like Altivec on PowerPC - the best option is to use GCC flags like -msse or -maltivec. I suggest also, a program has to have 4 threads in order to utilize all available cores, right?
2) SDRAM-CPU bandwidth (highest limit, so indep. on the mainboard).
UPDATE: Since GCC 3, GCC can automatically generate SSE/SSE2 scalar code when the target supports those instructions. Automatic vectorization for SSE/SSE2 has been added since GCC 4. SSE4.1 introduces DPPS, DPPD instructions - Dot product for Array of Structs data. New 45nm Intel processors support SSE4 instructions.
First off, know that it will most likely not be possible for your code to both run as fast as possible on modern vector FPU units and be completely portable across architectures. It is possible to abstract away some aspects of the architectures via macros, etc, but compilers are (at present) capable of generating nearly optimal auto-vectorized code only for very simple programs.
Now, on to your questions: current x86 hardware does not have a multiply-accumulate, but is capable of one vector add and one vector multiply per cycle per core. Assuming that your code achieves full computational density, and you either hand-write vector code or your code is simple enough for the compiler to handle the task, the peak throughput that can be achieved independent of memory access latency is:
number of cores * cycles per second * flops per cycle * vector width
Which in your case sounds like:
4 * 3.2 GHz * 2 vector flops/cycle * 4 floats/vector = 102.4 Gflops
If you are going to write scalar code, divide that by four. If you are going to write vector code in C with some level of portable abstraction, plan to be leaving some performance on the table, but you can certainly go substantially faster than scalar code will allow. 50% of theoretical peak is a conservative guess (I would expect to do better assuming the algorithms are amenable to vectorization, but make sure you have some headroom in your estimates).
edit: notes on DPPS:
DPPS is not a multiply-add, and using it as one is a performance hazard on current architectures. Looking it up in the Intel Optimization Manual, you will find that it has a latency of 11 cycles, and throughput is only one vector result every two cycles. DPPS does up to four multiplies and three adds, so you're getting 2 multiplies per cycle and 1.5 adds, whereas using MULPS and ADDPS would get you 4 of each every cycle.
More generally, horizontal vector operations should be avoided unless absolutely necessary; lay out your data so that your operations stay within vector lanes to the maximum extent possible.
In fairness to Intel, if you can't change your data layout, and DPPS happens to be exactly the operation that you need, then you want to use it. Just be aware that you're limiting yourself to less than 50% of peak FP throughput right off the bat by doing so.
This may not directly answer your question, but have you considered using the PC's graphics cards for parallel floating-point computations? It's getting to the point where GPUs will outperform CPUs for some tasks; and the nice thing is that graphics cards are reasonably competitively priced.
I'm short on details, sorry; this is just to give you an idea.
Some points you should consider:
1) Intel's i7-architecture is in the moment your fastest options for 1 or 2 CPUs. Only for 4 or more sockets AMD's Opterons can compete.
2) Intel's compilers generate code that is often significantly faster that code generated by other compilers (when used on AMD's CPUs you have to patch away some CPU checks Intel puts in to prevent AMD to look good).
3) No x86-CPU supports multiply-and-add yet, AMD's next architecure "Bulldozer" will probably be the first to support it.
4) High memory bandwidth you get on any AMD CPU and on Intel only for the new i7-architecture (socket 1366 is better than 775).
5) Use Intel's highly efficient libraries
if possible.