How to swap the byte order for individual words in a vector in ARM/ACLE - arm

I usually write portable C code and try to adhere to strictly standard-conforming subset of the features supported by compilers.
However, I'm writing codes that exploits the ARM v8 Cryptography extensions to implement SHA-1 (and SHA-256 some days later). A problem that I face, is that, FIPS-180 specify the hash algorithms using big-endian byte order, whereas most ARM-based OS ABIs are little-endian.
If it's a single integer operand (on general purpose register) I can use the APIs specified for the next POSIX standard, but I'm working with SIMD registers, since it's where ARMv8 Crypto works.
So Q: how do I swap the byte order for words in a vector register on ARM? I'm fine with assembly answers, but prefer ACLE intrinsics ones.

The instructions are:
REV16 for byte-swapping short integers,
REV32 for byte-swapping 32-bit integers, and
REV64 for byte-swapping 64-bit integers.
They can be used to swap the byte AND word order of any length that's strictly less than what their name indicates. They're defined in section C7.2.219~C7.2.221 of Arm® Architecture Reference Manual
Armv8, for A-profile architecture "DDI0487G_b_armv8_arm.pdf"
e.g. REV32 can be used to reverse the order of 2 short integers within each 32-bit words:
[00][01][02][03][04][05][06][07]
to
[02][03][00][01][06][07][04][05]
Their intrinsics are defined in a separate document: Arm Neon Intrinsics Reference "advsimd-2021Q2.pdf"
To swap the 32-bit words in a 128-bit vector, use the vrev32q_u8 instrinsic. Relevant vreinterpretq_* intrinsics need to be used to re-interpret the type of the operands.

Related

Will Knights Landing CPU (Xeon Phi) accelerate byte/word integer code?

The Intel Xeon Phi "Knights Landing" processor will be the first to support AVX-512, but it will only support "F" (like SSE without SSE2, or AVX without AVX2), so floating-point stuff mainly.
I'm writing software that operates on bytes and words (8- and 16-bit) using up to SSE4.1 instructions via intrinsics.
I am confused whether there will be EVEX-encoded versions of all/most SSE4.1 instructions in AVX-512F, and whether this means I can expect my SSE code to automatically gain EVEX-extended instructions and map to all new registers.
Wikipedia says this:
The width of the SIMD register file is increased from 256 bits to 512 bits, with a total of 32 registers ZMM0-ZMM31. These registers can be addressed as 256 bit YMM registers from AVX extensions and 128-bit XMM registers from Streaming SIMD Extensions, and legacy AVX and SSE instructions can be extended to operate on the 16 additional registers XMM16-XMM31 and YMM16-YMM31 when using EVEX encoded form.
This unfortunately does not clarify whether compiling SSE4 code with AVX512-enabled will lead to the same (awesome) speedup that compiling it to AVX2 provides (VEX coding of legacy instructions).
Anybody know what will happen when SSE2/4 code (C intrinsics) are compiled for AVX-512F? Could one expect a speed bump like with AVX1's VEX coding of the byte and word instructions?
Okay, I think I've pieced together enough information to make a decent answer. Here goes.
What will happen when native SSE2/4 code is run on Knights Landing (KNL)?
The code will run in the bottom fourth of the registers on a single VPU (called the compatibility layer) within a core. According to a pre-release webinar from Colfax, this means occupying only 1/4 to 1/8 of the total register space available to a core and running in legacy mode.
What happens if the same code is recompiled with compiler flags for AVX-512F?
SSE2/4 code will be generated with VEX prefix. That means pshufb becomes vpshufb and works with other AVX code in ymm. Instructions will NOT be promoted to AVX512's native EVEX or allowed to address the new zmm registers specifically. Instructions can only be promoted to EVEX with AVX512-VL, in which case they gain the ability to directly address (renamed) zmm registers. It is unknown whether register sharing is possible at this point, but pipelining on AVX2 has demonstrated similar throughput with half-width AVX2 (AVX-128) as with full 256-bit AVX2 code in many cases.
Most importantly, how do I get my SSE2/4/AVX128 byte/word size code running on AVX512F?
You'll have to load 128-bit chunks into xmm, sign/zero extend those bytes/words into 32-bit in zmm, and operate as if they were always larger integers. Then when finished, convert back to bytes/words.
Is this fast?
According to material published on Larrabee (Knights Landing's prototype), type conversions of any integer width are free from xmm to zmm and vice versa, so long as registers are available. Additionally, after calculations are performed, the 32-bit results can be truncated on the fly down to byte/word length and written (packed) to unaligned memory in 128-bit chunks, potentially saving an xmm register.
On KNL, each core has 2 VPUs that seem to be capable of talking to each other. Hence, 32-way 32-bit lookups are possible in a single vperm*2d instruction of presumably reasonable throughput. This is not possible even with AVX2, which can only permute within 128-bit lanes (or between lanes for the 32-bit vpermd only, which is inapplicable to byte/word instructions). Combined with free type conversions, the ability to use masks implicitly with AVX512 (sparing the costly and register-intensive use of blendv or explicit mask generation), and the presence of more comparators (native NOT, unsigned/signed lt/gt, etc), it may provide a reasonable performance boost to rewrite SSE2/4 byte/word code for AVX512F after all. At least on KNL.
Don't worry, I'll test the moment I get my hands on mine. ;-)

How can Microsoft say the size of a word in WinAPI is 16 bits?

I've just started learning the WinAPI. In the MSDN, the following explanation is provided for the WORD data type.
WORD
A 16-bit unsigned integer. The range is 0 through 65535 decimal.
This type is declared in WinDef.h as follows:
typedef unsigned short WORD;
Simple enough, and it matches with the other resources I've been using for learning, but how can it can definitively said that it is 16 bits? The C data types page on Wikipedia specifies
short / short int / signed short / signed short int
Short signed integer
type. Capable of containing at least the [−32767, +32767] range; thus,
it is at least 16 bits in size.
So the size of a short could very well be 32 bits according to the C Standard. But who decides what bit sizes are going to be used anyway? I found a practical explanation here. Specifically, the line:
...it depends on both processors (more specifically, ISA, instruction
set architecture, e.g., x86 and x86-64) and compilers including
programming model.
So it's the ISA then, which makes sense I suppose. This is where I get lost. Taking a look at the Windows page on Wikipedia, I see this in the side bar:
Platforms
ARM, IA-32, Itanium, x86-64, DEC Alpha, MIPS, PowerPC
I don't really know what these are but I think these are processors, each which would have an ISA. Maybe Windows supports these platforms because all of them are guaranteed to use 16 bits for an unsigned short? This doesn't sound quite right, but I don't really know enough about this stuff to research any further.
Back to my question: How is it that the Windows API can typedef unsigned short WORD; and then say WORD is a 16-bit unsigned integer when the C Standard itself does not guarantee that a short is always 16 bits?
Simply put, a WORD is always 16 bits.
As a WORD is always 16 bits, but an unsigned short is not, a WORD is not always an unsigned short.
For every platform that the Windows SDK supports, the windows header file contains #ifdef style macros that can detect the compiler and its platform, and associate the Windows SDK defined types (WORD, DWORD, etc) to the appropriately sized platform types.
This is WHY the Windows SDK actually uses internally defined types, such as WORD, rather than using language types: so that they can ensure that their definitions are always correct.
The Windows SDK that ships with Microsoft toolchains, is possibly lazy, as Microsoft c++ toolchains always use 16bit unsigned shorts.
I would not expect the windows.h that ships with Visual Studio C++ to work correctly if dropped into GCC, clang etc. as so many details, including the mechanism of importing dll's using .iib files that the Platform SDK distribute, is a Microsoft specific implementation.
A different interpretation is that:
Microsoft says a WORD is 16 bits. If "someone" wants to call a windows API, they must pass a 16 bit value where the API defines the field as a WORD.
Microsoft also possibly says, in order to build a valid windows program, using the windows header files present in their Windows SDK, the user MUST choose a compiler that has a 16bit short.
The c++ spec does not say that compilers must implement shorts as 16 bits - Microsoft says the compiler you choose to build windows executables must.
There was originally an assumption that all code intended to run on Windows would be compiled with Microsoft's own compiler - or a fully compatible compiler. And that's the way it worked. Borland C: Matched Microsoft C. Zortech's C: Matched Microsoft C. gcc: not so much, so you didn't even try (not to mention there were no runtimes, etc.).
Over time this concept got codified and extended to other operating systems (or perhaps the other operating systems got it first) and now it is known as an ABI - Application Binary Interface - for a platform, and all compilers for that platform are assumed (in practice, required) to match the ABI. And that means matching expectations for the sizes of integral types (among other things).
An interesting related question you didn't ask is: So why is 16-bits called a word? Why is 32-bits a dword (double word) on our 32- and now 64-bit architectures where the native machine "word" size is 32- or 64-, not 16? Because: 80286.
In the windows headers there is a lot of #define that based on the platform can ensure a WORD is 16 bit a DWORD is 32 etc etc. In some case in the past, I know they distribuite a proper SDK for each platform. In any case nothing magic, just a mixture of proper #defines and headers.
The BYTE=8bits, WORD=16bits, and DWORD=32bits (double-word) terminology comes from Intel's instruction mnemonics and documentation for 8086. It's just terminology, and at this point doesn't imply anything about the size of the "machine word" on the actual machine running the code.
My guess:
Those C type names were probably originally introduced for the same reason that C99 standardized uint8_t, uint16_t, and uint32_t. The idea was probably to allow C implementations with an incompatible ABI (e.g. 16 vs. 32-bit int) to still compile code that uses the WinAPI, because the ABI uses DWORD rather than long or int in structs, and function args / return values.
Probably as Windows evolved, enough code started depending in various ways on the exact definition of WORD and DWORD that MS decided to standardize the exact typedefs. This diverges from the C99 uint16_t idea, where you can't assume that it's unsigned short.
As #supercat points out, this can matter for aliasing rules. e.g. if you modify an array of unsigned long[] through a DWORD*, it's guaranteed that it will work as expected. But if you modify an array of unsigned int[] through a DWORD*, the compiler might assume that didn't affect array values that it already had in registers. This also matters for printf format strings. (C99's <stdint.h> solution to that is preprocessor macros like PRIu32.)
Or maybe the idea was just to use names that match the asm, to make sure nobody was confused about the width of types. In the very early days of Windows, writing programs in asm directly, instead of C, was popular. WORD/DWORD makes the documentation clearer for people writing in asm.
Or maybe the idea was just to provide a fixed-width types for portable code. e.g. #ifdef SUNOS: define it to an appropriate type for that platform. This is all it's good for at this point, as you noticed:
How is it that the Windows API can typedef unsigned short WORD; and then say WORD is a 16-bit unsigned integer when the C Standard itself does not guarantee that a short is always 16 bits?
You're correct, documenting the exact typedefs means that it's impossible to correctly implement the WinAPI headers in a system using a different ABI (e.g. one where long is 64bit or short is 32bit). This is part of the reason why the x86-64 Windows ABI makes long a 32bit type. The x86-64 System V ABI (Linux, OS X, etc.) makes long a 64bit type.
Every platform does need a standard ABI, though. struct layout, and even interpretation of function args, requires all code to agree on the size of the types used. Code from different version of the same C compiler can interoperate, and even other compilers that follow the same ABI. (However, C++ ABIs aren't stable enough to standardize. For example, g++ has never standardized an ABI, and new versions do break ABI compatibility.)
Remember that the C standard only tells you what you can assume across every conforming C implementation. The C standard also says that signed integers might be sign/magnitude, one's complement, or two's complement. Any specific platform will use whatever representation the hardware does, though.
Platforms are free to standardize anything that the base C standard leaves undefined or implementation-defined. e.g. x86 C implementations allow creating unaligned pointers to exist, and even to dereference them. This happens a lot with __m128i vector types.
The actual names chosen tie the WinAPI to its x86 heritage, and are unfortunately confusing to anyone not familiar with x86 asm, or at least Windows's 16bit DOS heritage.
The 8086 instruction mnemonics that include w for word and d for dword were commonly used as setup for idiv signed division.
cbw: sign extend AL (byte) into AX (word)
cwd: sign extend AX (word) into DX:AX (dword), i.e. copy the sign bit of ax into every bit of dx.
These insns still exist and do exactly the same thing in 32bit and 64bit mode. (386 and x86-64 added extended versions, as you can see in those extracts from Intel's insn set reference.) There's also lodsw, rep movsw, etc. string instructions.
Besides those mnemonics, operand-size needs to be explicitly specified in some cases, e.g.
mov dword ptr [mem], -1, where neither operand is a register that can imply the operand-size. (To see what assembly language looks like, just disassemble something. e.g. on a Linux system, objdump -Mintel -d /bin/ls | less.)
So the terminology is all over the place in x86 asm, which is something you need to be familiar with when developing an ABI.
More x86 asm background, history, and current naming schemes
Nothing below this point has anything to do with WinAPI or the original question, but I thought it was interesting.
See also the x86 tag wiki for links to Intel's official PDFs (and lots of other good stuff). This terminology is still ubiquitous in Intel and AMD documentation and instruction mnemonics, because it's completely unambiguous in a document for a specific architecture that uses it consistently.
386 extended register sizes to 32bits, and introduced the cdq instruction: cdq (eax (dword) -> edx:eax (qword)). (Also introduced movsx and movzx, to sign- or zero-extend without without needing to get the data into eax first.) Anyway, quad-word is 64bits, and was used even in pre-386 for double-precision memory operands for fld qword ptr [mem] / fst qword ptr [mem].
Intel still uses this b/w/d/q/dq convention for vector instruction naming, so it's not at all something they're trying to phase out.
e.g. the pshufd insn mnemonic (_mm_shuffle_epi32 C intrinsic) is Packed (integer) Shuffle Dword. psraw is Packed Shift Right Arithmetic Word. (FP vector insns use a ps (packed single) or pd (packed double) suffix instead of p prefix.)
As vectors get wider and wider, the naming starts to get silly: e.g. _mm_unpacklo_epi64 is the intrinsic for the punpcklqdq instruction: Packed-integer Unpack L Quad-words to Double-Quad. (i.e. interleave 64bit low halves into one 128b). Or movdqu for Move Double-Quad Unaligned loads/stores (16 bytes). Some assemblers use o (oct-word) for declaring 16 byte integer constants, but Intel mnemonics and documentation always use dq.
Fortunately for our sanity, the AVX 256b (32B) instructions still use the SSE mnemonics, so vmovdqu ymm0, [rsi] is a 32B load, but there's no quad-quad terminology. Disassemblers that include operand-sizes even when it's not ambiguous would print vmovdqu ymm0, ymmword ptr [rsi].
Even the names of some AVX-512 extensions use the b/w/d/q terminology. AVX-512F (foundation) doesn't include all element-size versions of every instruction. The 8bit and 16bit element size versions of some instructions are only available on hardware that supports the AVX-512BW extension. There's also AVX-512DQ for extra dword and qword element-size instructions, including conversion between float/double and 64bit integers and a multiply with 64b x 64b => 64b element size.
A few new instructions use numeric sizes in the mnemonic
AVX's vinsertf128 and similar for extracting the high 128bit lane of an 256bit vector could have used dq, but instead uses 128.
AVX-512 introduces a few insn mnemonics with names like vmovdqa64 (vector load with masking at 64bit element granularity) or vshuff32x4 (shuffle 128b elements, with masking at 32bit element granularity).
Note that since AVX-512 has merge-masking or zero-masking for almost all instructions, even instructions that didn't used to care about element size (like pxor / _mm_xor_si128) now come in different sizes: _mm512_mask_xor_epi64 (vpxorq) (each mask bit affects a 64bit element), or _mm512_mask_xor_epi32 (vpxord). The no-mask intrinsic _mm512_xor_si512 could compile to either vpxorq or vpxord; it doesn't matter.
Most AVX512 new instructions still use b/w/d/q in their mnemonics, though, like VPERMT2D (full permute selecting elements from two source vectors).
Currently there are no platforms that support Windows API but have unsigned short as not 16-bit.
If someone ever did make such a platform, the Windows API headers for that platform would not include the line typedef unsigned short WORD;.
You can think of MSDN pages as describing typical behaviour for MSVC++ on x86/x64 platforms.
The legacy for types like WORD predates Windows back to the days of MSDOS following the types defined by MASM (later the name was changed to ML). Not adopted by Windows API are MASM's signed types, such as SBYTE, SWORD, SDWORD, SQWORD.
QWORD / SQWORD in MASM probably wasn't defined until MASM / ML supported 80386.
A current reference:
http://msdn.microsoft.com/en-us/library/8t163bt0.aspx
Windows added types such as HANDLE, WCHAR, TCHAR, ... .
For Windows / Microsoft compilers, size_t is an unsigned integer the same size as a poitner, 32 bits if in 32 bit mode, 64 bits if in 64 bit mode.
The DB and DW data directives in MASM go back to the days of Intel's 8080 assembler.

Is __int128_t arithmetic emulated by GCC, even with SSE?

I've heard that the 128-bit integer data-types like __int128_t provided by GCC are emulated and therefore slow. However, I understand that the various SSE instruction sets (SSE, SSE2, ..., AVX) introduced at least some instructions for 128-bit registers. I don't know very much about SSE or assembly / machine code, so I was wondering if someone could explain to me whether arithmetic with __int128_t is emulated or not using modern versions of GCC.
The reason I'm asking this is because I'm wondering if it makes sense to expect big differences in __int128_t performance between different versions of GCC, depending on what SSE instructions are taken advantage of.
So, what parts of __int128_t arithmetic are emulated by GCC, and what parts are implemented with SSE instructions (if any)?
I was confusing two different things in my question.
Firstly, as PaulR explained in the comments: "There are no 128 bit arithmetic operations in SSE or AVX (apart from bitwise operations)". Considering this, 128-bit arithmetic has to be emulated on modern x86-64 based processors (e.g. AMD Family 10 or Intel Core architecture). This has nothing to do with GCC.
The second part of the question is whether or not 128-bit arithmetic emulation in GCC benefits from SSE/AVX instructions or registers. As implied in PaulR's comments, there isn't much in SSE/AVX that's going to allow you to do 128-bit arithmetic more easily; most likely x86-64 instructions will be used for this. The code I'm interested in can't compile with -mno-sse, but it compiles fine with -mno-sse2 -mno-sse3 -mno-ssse3 -mno-sse4 -mno-sse4.1 -mno-sse4.2 -mno-avx -mno-avx2 and performance isn't affected. So my code doesn't benefit from modern SSE instructions.
SSE2-AVX instructions are available for 8,16,32,64-bit integer data types. They are mostly intended to treat packed data together, for example, 128-bit register may contain four 32-bit integers and so on.
Although SSE/AVX/AVX-512/etc. have no 128-bit mode (their vector elements are strictly 64-bit max, and operations will simply overflow), as Paul R has implied, the main CPU does support limited 128-bit operations, by using a pair of registers.
When multiplying two regular 64-bit number, MUL/IMUL can outputs its 128-bit result in the RAX/RDX register pair.
Inversely, when dividing DIV/IDIV can take its input from then RAX/RDX pair to divide a 128-bit number by a 64-bit divisor (and outputs 64-bit quotient + 64-bit modulo)
Of course the CPU's ALU is 64-bit, thus - as implied Intel docs - these higher extra 64-bit come at the cost of extra micro-ops in the microcode. This is more dramatic for divisions (> 3x more) which already require lots of micro-ops to be processed.
Still that means that under some circumstances (like using a rule of three to scale a value), it's possible for a compiler to emit regular CPU instruction and not care to do any 128-bit emulation by itself.
This has been available for a long time:
since 80386, 32-bit CPU could do 64-bit multiplication/division using EAX:EDX pair
since 8086/88, 16-bit CPU could do 32-bit multiplication/division using AX:DX pair
(As for additions and subtraction: thank to the support for carry, it's completely trivial to do additions/subtractions of numbers of any arbitrary length that can fill your storage).

Is a 32 bit integer on a 16 bit machine possible?

Just wondering if it possible. If yes, are there other ways besides compiler emulation layer?
Thanks
It's processor-dependent. Some processors have special instructions to manipulate register pairs (e.g. the 8-bit AVR instruction set has instructions for 16-bit register pairs). On processors without such native support, the compiler usually emits instructions that work with pairs of registers at a time (this is what is usually done to support 64-bit numbers on 32-bit processors, for example).
Yes, it is possible. Look at the Z80 from the 70s as an example of a 8-bit processor that can manipulate 16-bit values.
Make sure you know what "16-bit processor" means because I have found a lot of people have a misconception about it. Does it mean the opcode size, because some processors have variable width operations? Does it mean the addressing size? Does it mean the smallest/largest value it can natively manipulate?
And as far as at compile-time, sure. Check out arbitrary large number libraries (aka "big nums").

How to treat 64-bit words on a CUDA device?

I'd like to handle directly 64-bit words on the CUDA platform (eg. uint64_t vars).
I understand, however, that addressing space, registers and the SP architecture are all 32-bit based.
I actually found this to work correctly (on my CUDA cc1.1 card):
__global__ void test64Kernel( uint64_t *word )
{
(*word) <<= 56;
}
but I don't know, for example, how this affects registers usage and the operations per clock cycle count.
Whether addresses are 32-bit or anything else does not affect what data types you can use. In your example you have a pointer (32-bit, 64-bit, 3-bit (!) - doesn't matter) to a 64-bit unsigned integer.
64-bit integers are supported in CUDA but of course for every 64-bit value you are storing twice as much data as a 32-bit value and so will use more registers and arithmetic operations will take longer (adding two 64-bit integers will just expand it out onto the smaller datatypes using carries to push into the next sub-word). The compiler is an optimising compiler, so will try to minimise the impact of this.
Note that using double precision floating point, also 64-bit, is only supported in devices with compute capability 1.3 or higher (i.e. 1.3 or 2.0 at this time).

Resources