Embedded broadcasts with intrinsics and assembly - c

In section 2.5.3 "Broadcasts" of the Intel Architecture Instruction Set Extensions Programming Reference the we learn than
AVX512 (and Knights Corner) has
a bit-field to encode data broadcast for some load-op instructions, i.e. instructions that
load data from memory and perform some computational
or data movement operation.
For example using Intel assembly syntax we can broadcast the scalar at the address stored in rax and then multiplying with the 16 floats in zmm2 and write the result to zmm1 like this
vmulps zmm1, zmm2, [rax] {1to16}
However, there are no intrinsics which can do this. Therefore, with intrinsics the compiler should be able to fold
__m512 bb = _mm512_set1_ps(b);
__m512 ab = _mm512_mul_ps(a,bb);
to a single instruction
vmulps zmm1, zmm2, [rax] {1to16}
but I have not observed GCC doing this. I found a GCC bug report about this.
I have observed something similar with FMA with GCC. e.g. GCC 4.9 will not collapse _mm256_add_ps(_mm256_mul_ps(areg0,breg0) to a single fma instruction with -Ofast. However, GCC 5.1 does collapse it to a single fma now. At least there are intrinsics to do this with FMA e.g. _mm256_fmadd_ps. But there is no e.g. _mm512_mulbroad_ps(vector,scalar) intrinsic.
GCC may fix this at some point but until then assembly is the only solution.
So my question is how to do this with inline assembly in GCC?
I think I may have come up with the correct syntax (but I am not sure) for GCC inline assembly for the example above.
"vmulps (%%rax)%{1to16}, %%zmm1, %%zmm2\n\t"
I am really looking for a function like this
static inline __m512 mul_broad(__m512 a, float b) {
return a*b;
}
where if b is in memory point to in rax it produces
vmulps (%rax){1to16}, %zmm0, %zmm0
ret
and if b is in xmm1 it produces
vbroadcastss %xmm1, %zmm1
vmulps %zmm1, %zmm0, %zmm0
ret
GCC will already do the vbroadcastss-from-register case with intrinsics, but if b is in memory, compiles this to a vbroadcastss from memory.
__m512 mul_broad(__m512 a, float b) {
__m512 bb = _mm512_set1_ps(b);
__m512 ab = _mm512_mul_ps(a,bb);
return ab;
}
clang will use a broadcast memory operand if b is in memory.

As Peter Cordes notes GCC doesn't let you specify a different template for different constraint alternatives. So instead my solution has the assembler choose the correct instruction according to the operands chosen.
I don't have a version of GCC that supports the ZMM registers, so this following example uses XMM registers and a couple of nonexistent instructions to demonstrate how you can achieve what you're looking for.
typedef __attribute__((vector_size(16))) float v4sf;
v4sf
foo(v4sf a, float b) {
v4sf ret;
asm(".ifndef isxmm\n\t"
".altmacro\n\t"
".macro ifxmm operand, rnum\n\t"
".ifc \"\\operand\",\"%%xmm\\rnum\"\n\t"
".set isxmm, 1\n\t"
".endif\n\t"
".endm\n\t"
".endif\n\t"
".set isxmm, 0\n\t"
".set regnum, 0\n\t"
".rept 8\n\t"
"ifxmm <%2>, %%regnum\n\t"
".set regnum, regnum + 1\n\t"
".endr\n\t"
".if isxmm\n\t"
"alt-1 %1, %2, %0\n\t"
".else\n\t"
"alt-2 %1, %2, %0\n\t"
".endif\n\t"
: "=x,x" (ret)
: "x,x" (a), "x,m" (b));
return ret;
}
v4sf
bar(v4sf a, v4sf b) {
return foo(a, b[0]);
}
This example should be compiled with gcc -m32 -msse -O3 and should generate two assembler error messages similar to the following:
t103.c: Assembler messages:
t103.c:24: Error: no such instruction: `alt-2 %xmm0,4(%esp),%xmm0'
t103.c:22: Error: no such instruction: `alt-1 %xmm0,%xmm1,%xmm0'
The basic idea here is the assembler checks to see whether the second operand (%2) is an XMM register or something else, presumably a memory location. Since the GNU assembler doesn't support much in the way of operations on strings, the second operand is compared to every possible XMM register one at a time in a .rept loop. The isxmm macro is used to paste %xmm and a register number together.
For your specific problem you'd probably need to rewrite it something like this:
__m512
mul_broad(__m512 a, float b) {
__m512 ret;
__m512 dummy;
asm(".ifndef isxmm\n\t"
".altmacro\n\t"
".macro ifxmm operand, rnum\n\t"
".ifc \"\\operand\",\"%%zmm\\rnum\"\n\t"
".set isxmm, 1\n\t"
".endif\n\t"
".endm\n\t"
".endif\n\t"
".set isxmm, 0\n\t"
".set regnum, 0\n\t"
".rept 32\n\t"
"ifxmm <%[b]>, %%regnum\n\t"
".set regnum, regnum + 1\n\t"
".endr\n\t"
".if isxmm\n\t"
"vbroadcastss %x[b], %[b]\n\t"
"vmulps %[a], %[b], %[ret]\n\t"
".else\n\t"
"vmulps %[b] %{1to16%}, %[a], %[ret]\n\t"
"# dummy = %[dummy]\n\t"
".endif\n\t"
: [ret] "=x,x" (ret), [dummy] "=xm,x" (dummy)
: [a] "x,xm" (a), [b] "m,[dummy]" (b));
return ret;
}

Related

GNU inline asm: same register for different output operands allowed?

I've written a small function with C-code and a short inline assembly statement.
Inside the inline assembly statement I need 2 "temporary" registers to load and compare some memory values.
To allow the compiler to choose "optimal temporary registers" I would like to avoid hard-coding those temp registers (and putting them into the clobber list).
Instead I decided to create 2 local variables in the surrounding C-function just for this purpose. I used "=r" to add these local variables to the output operands specification of the inline asm statement and then used them for my load/compare purposes.
These local variables are not used elsewhere in the C-function and (maybe because of this fact) the compiler decided to assign the same register to the two related output operands which makes my code unusable (comparison is always true).
Is the compiler allowed to use overlapping registers for different output operands or is this a compiler bug (I tend to rate this as a bug)?
I only found information regarding early clobbers which prevent overlapping of register for inputs and outputs... but no statement for just output operands.
A workaround is to initialize my temporary variables and to use "+r" instead of "=r" for them in the output operand specification. But in this case the compiler emits initialization instructions which I would like to avoid.
Is there any clean way to let the compiler choose optimal registers that do not overlap each other just for "internal inline assembly usage"?
Thank you very much!
P.S.: I code for some "exotic" target using a "non-GNU" compiler that supports "GNU inline assembly".
P.P.S.: I also don't understand in the example below why the compiler doesn't generate code for "int eq=0;" (e.g. 'mov d2, 0'). Maybe I totally misunderstood the "=" constraint modifier?
Totally useless and stupid example below just to illustrate (focus on) the problem:
int foo(const int *s1, const int *s2)
{
int eq = 0;
#ifdef WORKAROUND
int t1=0, t2=1;
#else
int t1, t2;
#endif
__asm__ volatile(
"ld.w %[t1], [%[s1]] \n\t"
"ld.w %[t2], [%[s2]] \n\t"
"jne %[t1], %[t2], 1f \n\t"
"mov %[eq], 1 \n\t"
"1:"
: [eq] "=d" (eq),
[s1] "+a" (s1), [s2] "+a" (s2),
#ifdef WORKAROUND
[t1] "+d" (t1), [t2] "+d" (t2)
#else
[t1] "=d" (t1), [t2] "=d" (t2)
#endif
);
return eq;
}
In the created asm the compiler used register 'd8' for both operands 't1' and 't2':
foo:
; 'mov d2, 0' is missing
ld.w d8, [a4] ; 'd8' allocated for 't1'
ld.w d8, [a5] ; 'd8' allocated for 't2' too!
jne d8, d8, 1f
mov d2, 1
1:
ret16
Compiling w/ '-DWORKAROUND':
foo:
; 'mov d2, 0' is missing
mov16 d9,1
mov16 d8,0
ld.w d9, [a5]
jne d8, d9, 1f
mov d2, 1
1:
ret16
EABI for this machine:
return register (non-pointer/pointer): d2, a2
non-pointer args: d4..d7
pointer args: a4..a7
I think this is a bug in your compiler.
If it says it supports "GNU inline assembly" then one would expect it to follow GCC, whose manual is the closest thing there is to a formal specification. Now the GCC manual doesn't seem to explicitly say "output operands will not share registers with each other", but as o11c mentions, they do suggest using output operands for scratch registers, and that wouldn't work if they could share registers.
A workaround that might be more efficient than yours would be to follow your inline asm with a second dummy asm statement that "uses" both the outputs. Hopefully this will convince the compiler that they are potentially different values and therefore need separate registers:
int t1, t2;
__asm__ volatile(" ... code ..."
: [t1] "=d" (t1), [t2] "=d" (t2) : ...);
__asm__ volatile("" // no code
: : "r" (t1), "r" (t2));
With luck this will avoid any extra code being generated for unnecessary initialization, etc.
Another possibility would be to hardcode specific scratch registers and declare them as clobbered. It leaves less flexibility for the register allocator, but depending on the surrounding code and how smart the compiler is, it may not make a lot of difference.

How to get bits of specific xmm registers?

So I want to get the value or state of specific xmm registers. This is primarily for a crash log or just to see the state of the registers for debugging. I tried this, but it doesn't seem to work:
#include <x86intrin.h>
#include <stdio.h>
int main(void) {
register __m128i my_val __asm__("xmm0");
__asm__ ("" :"=r"(my_val));
printf("%llu %llu\n", my_val & 0xFFFFFFFFFFFFFFFF, my_val << 63);
return 0;
}
As far as I know, the store related intrinsics would not treat the __m128i as a POD data type but rather as a reference to one of the xmm registers.
How do I get and access the bits stored in the __m128i as 64 bit integers? Or does my __asm__ above work?
How do I get and access the bits stored in the __m128i as 64 bit integers?
You will have to convert the __m128i vector to a pair of uint64_t variables. You can do that with conversion intrinsics:
uint64_t lo = _mm_cvtsi128_si64(my_val);
uint64_t hi = _mm_cvtsi128_si64(_mm_unpackhi_epi64(my_val, my_val));
...or though memory:
uint64_t buf[2];
_mm_storeu_si128((__m128i*)buf, my_val);
uint64_t lo = buf[0];
uint64_t hi = buf[1];
The latter may be worse in terms of performance, but if you intend to use it only for debugging, it would do. It is also trivial to adapt to differently sized elements, if you need that.
Or does my __asm__ above work?
No, it doesn't. The "=r" output constraint does not allow vector registers, such as xmm0, which you pass as an output, it only allows general purpose registers. No general purpose registers are 128-bit wide, so that asm statement makes no sense.
Also, I should note that my_val << 63 shifts the value in the wrong way. If you wanted to output the high half of the hypothetical 128-bit value then you should've shifted right, not left. And besides that, shifts on vectors are either not implemented or act on each element of the vector rather than the vector as a whole, depending on the compiler. But this part is moot, as with the code above you don't need any shifts to output the two halves.
If you really want to know about register values, rather than __m128i C variable values, I'd suggest using a debugger like GDB. print /x $xmm0.v2_int64 when stopped at a breakpoint.
Capturing a register at the top of a function is a pretty flaky and unreliable thing to try to attempt (smells like you've already gone down the wrong design path)1. But you're on the right track with a register-asm local var. However, xmm0 can't match an "=r" constraint, only "=x". See Reading a register value into a C variable for more about using an empty asm template to tell the compiler you want a C variable to be what was in a register.
You do need the asm volatile("" : "=x"(var)); statement, though; GNU C register-asm local vars have no guarantees whatsoever except when used as operands to asm statements. (GCC will often keep your var in that register anyway, but IIRC clang won't.)
There's not a lot of guarantee about where this will be ordered wrt. other code (asm volatile may help some, or for stronger ordering also use a "memory" clobber). Also no guarantee that GCC won't use the register for something else first. (Especially a call-clobbered register like any xmm reg.) But it does at least happen to work in the version I tested.
print a __m128i variable shows how to print a __m128i as two 64-bit halves once you have it, or as other element sizes. The compiler will often optimize _mm_store_si128 / reload into shuffles, and this is for printing anyway so keep it simple.
Using a unsigned __int128 tmp; would also be an option in GNU C on x86-64.
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#ifndef __cplusplus
#include <stdalign.h>
#endif
// If you need this, you're probably doing something wrong.
// There's no guarantee about what a compiler will have in XMM0 at any point
void foo() {
register __m128i xmm0 __asm__("xmm0");
__asm__ volatile ("" :"=x"(xmm0));
alignas(16) uint64_t buf[2];
_mm_store_si128((__m128i*)buf, xmm0);
printf("%llu %llu\n", buf[1], buf[0]); // I'd normally use hex, like %#llx
}
This prints the high half first (most significant), so reading left to right across both elements we get each byte in descending order of memory address within buf.
It compiles to the asm we want with both GCC and clang (Godbolt), not stepping on xmm0 before reading it.
# GCC10.2 -O3
foo:
movhlps xmm1, xmm0
movq rdx, xmm0 # low half -> RDX
mov edi, OFFSET FLAT:.LC0
xor eax, eax
movq rsi, xmm1 # high half -> RSI
jmp printf
Footnote 1:
If you make sure your function doesn't inline, you could take advantage of the calling convention to get the incoming values of xmm0..7 (for x86-64 System V), or xmm0..3 if you have no integer args (Windows x64).
__attribute__((noinline))
void foo(__m128i xmm0, __m128i xmm1, __m128i xmm2, etc.) {
// do whatever you want with the xmm0..7 args
}
If you want to provide a different prototype for the function for callers to use (which omits the __m128i args), that can maybe work. It's of course Undefined Behaviour in ISO C, but if you truly stop inlining, the effects depend on the calling convention. As long as you make sure it's noinline so link-time optimization doesn't do cross-file inlining.
Of course, the mere fact of inserting a function call will change register allocation in the caller, so this only helps for a function you were going to call anyway.

GNU C inline asm input constraint for AVX512 mask registers (k1...k7)?

AVX512 introduced opmask feature for its arithmetic commands. A simple example: godbolt.org.
#include <immintrin.h>
__m512i add(__m512i a, __m512i b) {
__m512i sum;
asm(
"mov ebx, 0xAAAAAAAA; \n\t"
"kmovw k1, ebx; \n\t"
"vpaddd %[SUM] %{k1%}%{z%}, %[A], %[B]; # conditional add "
: [SUM] "=v"(sum)
: [A] "v" (a),
[B] "v" (b)
: "ebx", "k1" // clobbers
);
return sum;
}
-march=skylake-avx512 -masm=intel -O3
mov ebx,0xaaaaaaaa
kmovw k1,ebx
vpaddd zmm0{k1}{z},zmm0,zmm1
The problem is that k1 has to be specified.
Is there an input constraint like "r" for integers except that it picks a k register instead of a general-purpose register?
__mmask16 is literally a typedef for unsigned short (and other mask types for other plain integer types), so we just need a constraint for passing it in a k register.
We have to go digging in the gcc sources config/i386/constraints.md to find it:
The constraint for any mask register is "k". Or use "Yk" for k1..k7 (which can be used as a predicate, unlike k0). You'd use an "=k" operand as the destination for a compare-into-mask, for example.
Obviously you can use "=Yk"(tmp) with a __mmask16 tmp to get the compiler to do register allocation for you, instead of just declaring clobbers on whichever "k" registers you decide to use.
Prefer intrinsics like _mm512_maskz_add_epi32
First of all, https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it. Understanding asm is great, but use that to read compiler output and/or figure out what would be optimal, then write intrinsics that can compile the way you want. Performance tuning info like https://agner.org/optimize/ and https://uops.info/ list things by asm mnemonic, and they're shorter / easier to remember than intrinsics, but you can search by mnemonic to find intrinsics on https://software.intel.com/sites/landingpage/IntrinsicsGuide/
Intrinsics will also let the compiler fold loads into memory source operands for other instructions; with AVX512 those can even be broadcast loads! Your inline asm forces the compiler to use a separate load instruction. Even a "vm" input won't let the compiler pick a broadcast-load as the memory source, because it wouldn't know the broadcast element width of the instruction(s) you were using it with.
Use _mm512_mask_add_epi32 or _mm512_maskz_add_epi32 especially if you're already using __m512i types from <immintrin.h>.
Also, your asm has a bug: you're using {k1} merge-masking not {k1}{z} zero-masking, but you used uninitialized __m512i sum; with an output-only "=v" constraint as the merge destination! As a stand-alone function, it happens to merge into a because the calling convention has ZMM0 = first input = return value register. But when inlining into other functions, you definitely can't assume that sum will pick the same register as a. Your best bet is to use a read/write operand for "+v"(a) and use is as the destination and first source.
Merge-masking only makes sense with a "+v" read/write operand. (Or in an asm statement with multiple instructions where you've already written an output once, and want to merge another result into it.)
Intrinsics would stop you from making this mistake; the merge-masking version has an extra input for the merge-target. (The asm destination operand).
Example using "Yk"
// works with -march=skylake-avx512 or -march=knl
// or just -mavx512f but don't do that.
// also needed: -masm=intel
#include <immintrin.h>
__m512i add_zmask(__m512i a, __m512i b) {
__m512i sum;
asm(
"vpaddd %[SUM] %{%[mask]%}%{z%}, %[A], %[B]; # conditional add "
: [SUM] "=v"(sum)
: [A] "v" (a),
[B] "v" (b),
[mask] "Yk" ((__mmask16)0xAAAA)
// no clobbers needed, unlike your question which I fixed with an edit
);
return sum;
}
Note that all the { and } are escaped with % (https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Special-format-strings), so they're not parsed as dialect-alternatives {AT&T | Intel-syntax}.
This compiles with gcc as early as 4.9, but don't actually do that because it doesn't understand -march=skylake-avx512, or even have tuning settings for Skylake or KNL. Use a more recent GCC that knows about your CPU for best results.
Godbolt compiler explorer:
# gcc8.3 -O3 -march=skylake-avx512 or -march=knl (and -masm=intel)
add(long long __vector, long long __vector):
mov eax, -21846
kmovw k1, eax # compiler-generated
# inline asm starts
vpaddd zmm0 {k1}{z}, zmm0, zmm1; # conditional add
# inline asm ends
ret
-mavx512bw (implied by -march=skylake-avx512 but not knl) is required for "Yk" to work on an int. If you're compiling with -march=knl, integer literals need a cast to __mmask16 or __mask8, because unsigned int = __mask32 isn't available for masks.
[mask] "Yk" (0xAAAA) requires AVX512BW even though the constant does fit in 16 bits, just because bare integer literals always have type int. (vpaddd zmm has 16 elements per vector, so I shortened your constant to 16-bit.) With AVX512BW, you can pass wider constants or leave out the cast for narrow ones.
gcc6 and later support -march=skylake-avx512. Use that to set tuning as well as enabling everything. Preferably gcc8 or at least gcc7. Newer compilers generate less clunky code with new ISA extensions like AVX512 if you're ever using it outside of inline asm.
gcc5 supports -mavx512f -mavx512bw but doesn't know about Skylake.
gcc4.9 doesn't support -mavx512bw.
"Yk" is unfortunately not yet documented in https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html.
I knew where to look in the GCC source thanks to Ross's answer on In GNU C inline asm, what are the size-override modifiers for xmm/ymm/zmm for a single operand?
While it is undocumented, looking here we see:
(define_register_constraint "Yk" "TARGET_AVX512F ? MASK_REGS :
NO_REGS" "#internal Any mask register that can be used as predicate,
i.e. k1-k7.")
Editing your godbolt to this:
asm(
"vpaddd %[SUM] %{%[k]}, %[A], %[B]"
: [SUM] "=v"(sum)
: [A] "v" (a), [B] "v" (b), [k] "Yk" (0xaaaaaaaa) );
seems to produce the correct output.
That said, I usually try to discourage people from using inline asm (and undocumented features). Can you use _mm512_mask_add_epi32?

errors when pushing xmm registers onto stack

I'm trying to push an xmm register onto the stack in x86_64 C code using GCC-style inline assembly. I looked at the answer to this question and am using this code
int main(void) {
asm volatile("subq 16, %rsp");
asm volatile("movdqu xmm0, xmmword ptr (%rsp)");
}
and when I compile it on OS X 10.10.2 with clang 6.0, I get the error error: unexpected token in argument list, and a green arrow pointing to the ptr in the second asm line.
I change the code to
int main(void) {
asm volatile("subq 16, %rsp");
asm volatile("movdqu xmm0, xmmword (%rsp)");
}
and it gives me error: invalid operand for instruction. I've tried changing xmmword to dqword, to no avail, and I'm not sure what I'm doing wrong.
Thanks in advance.
There are (at least) two dialects of assembler for the x86: intel format and at&t format. It looks like you are trying to write code using intel format, but compiling using at&t.
Your code will compile with:
int main(void) {
asm volatile("subq 16, %rsp");
asm volatile("movdqu %xmm0, (%rsp)");
}
If you use the -masm=intel compile switch, you can also use this (which may look more familiar to you):
int main(void) {
asm volatile("sub rsp, 16");
asm volatile("movdqu xmmword ptr [rsp], xmm0");
}
That said, writing an asm block using multiple asm statements like this is a bad idea. gcc docs explicitly state:
Do not expect a sequence of asm statements to remain perfectly consecutive after compilation. If certain instructions need to remain consecutive in the output, put them in a single multi-instruction asm statement.
So perhaps something more like:
int main(void) {
asm volatile("subq 16, %rsp\n"
"movdqu %xmm0, (%rsp)");
}
Also, if you are going to be reading or updating variables, you should not be using basic asm, but instead use Extended. The docs there are very detailed and there are a number of samples.

how to work with 128 bits C variable and xmm 128 bits asm?

in gcc, i want to do a 128 bits xor with 2 C variables, via asm code: how?
asm (
"movdqa %1, %%xmm1;"
"movdqa %0, %%xmm0;"
"pxor %%xmm1,%%xmm0;"
"movdqa %%xmm0, %0;"
:"=x"(buff) /* output operand */
:"x"(bu), "x"(buff)
:"%xmm0","%xmm1"
);
but i have a Segmentation fault error;
this is the objdump output:
movq -0x80(%rbp),%xmm2
movq -0x88(%rbp),%xmm3
movdqa %xmm2,%xmm1
movdqa %xmm2,%xmm0
pxor %xmm1,%xmm0
movdqa %xmm0,%xmm2
movq %xmm2,-0x78(%rbp)
You would see segfault issues if the variables aren't 16-byte aligned. The CPU can't MOVDQA to/from unaligned memory addresses, and would generate a processor-level "GP exception", prompting the OS to segfault your app.
C variables you declare (stack, global) or allocate on the heap aren't generally aligned to a 16 byte boundary, though occasionally you may get an aligned one by chance. You could direct the compiler to ensure proper alignment by using the __m128 or __m128i data types. Each of those declares a properly-aligned 128 bit value.
Further, reading the objdump, it looks like the compiler wrapped the asm sequence with code to copy the operands from the stack to the xmm2 and xmm3 registers using the MOVQ instruction, only to have your asm code then copy the values to xmm0 and xmm1. After xor-ing into xmm0, the wrapper copies the result to xmm2 only to then copy it back to the stack. Overall, not terribly efficient. MOVQ copies 8 bytes at a time, and expects (under some circumstances), an 8-byte aligned address. Getting an unaligned address, it could fail just like MOVDQA. The wrapper code, however, adds an aligned offset (-0x80, -0x88, and later -0x78) to the BP register, which may or may not contain an aligned value. Overall, there's no guaranty of alignment in the generated code.
The following ensures the arguments and result are stored in correctly aligned memory locations, and seems to work fine:
#include <stdio.h>
#include <emmintrin.h>
void print128(__m128i value) {
int64_t *v64 = (int64_t*) &value;
printf("%.16llx %.16llx\n", v64[1], v64[0]);
}
void main() {
__m128i a = _mm_setr_epi32(0x00ffff00, 0x00ffff00, 0x00ffff00, 0x10ffff00), /* low dword first! */
b = _mm_setr_epi32(0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff),
x;
asm (
"movdqa %1, %%xmm0;" /* xmm0 <- a */
"movdqa %2, %%xmm1;" /* xmm1 <- b */
"pxor %%xmm1, %%xmm0;" /* xmm0 <- xmm0 xor xmm1 */
"movdqa %%xmm0, %0;" /* x <- xmm0 */
:"=x"(x) /* output operand, %0 */
:"x"(a), "x"(b) /* input operands, %1, %2 */
:"%xmm0","%xmm1" /* clobbered registers */
);
/* printf the arguments and result as 2 64-bit hex values */
print128(a);
print128(b);
print128(x);
}
compile with (gcc, ubuntu 32 bit)
gcc -msse2 -o app app.c
output:
10ffff0000ffff00 00ffff0000ffff00
0000ffff0000ffff 0000ffff0000ffff
10ff00ff00ff00ff 00ff00ff00ff00ff
In the code above, _mm_setr_epi32 is used to initialize a and b with 128 bit values, as the compiler may not support 128 integer literals.
print128 writes out the hexadecimal representation of a 128 bit integer, as printf may not be able to do so.
The following is shorter and avoids some of the duplicate copying. The compiler adds its hidden wrapping movdqa's to make pxor %2,%0 magically work without you having to load the registers on your own:
#include <stdio.h>
#include <emmintrin.h>
void print128(__m128i value) {
int64_t *px = (int64_t*) &value;
printf("%.16llx %.16llx\n", px[1], px[0]);
}
void main() {
__m128i a = _mm_setr_epi32(0x00ffff00, 0x00ffff00, 0x00ffff00, 0x10ffff00),
b = _mm_setr_epi32(0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff);
asm (
"pxor %2, %0;" /* a <- b xor a */
:"=x"(a) /* output operand, %0 */
:"x"(a), "x"(b) /* input operands, %1, %2 */
);
print128(a);
}
compile as before:
gcc -msse2 -o app app.c
output:
10ff00ff00ff00ff 00ff00ff00ff00ff
Alternatively, if you'd like to avoid the inline assembly, you could use the SSE intrinsics instead (PDF). Those are inlined functions/macros that encapsulate MMX/SSE instructions with a C-like syntax. _mm_xor_si128 reduces your task to a single call:
#include <stdio.h>
#include <emmintrin.h>
void print128(__m128i value) {
int64_t *v64 = (int64_t*) &value;
printf("%.16llx %.16llx\n", v64[1], v64[0]);
}
void main()
{
__m128i x = _mm_xor_si128(
_mm_setr_epi32(0x00ffff00, 0x00ffff00, 0x00ffff00, 0x10ffff00), /* low dword first !*/
_mm_setr_epi32(0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff));
print128(x);
}
compile:
gcc -msse2 -o app app.c
output:
10ff00ff00ff00ff 00ff00ff00ff00ff
Umm, why not use the __builtin_ia32_pxor intrinsic?
Under late model gcc (mine is 4.5.5) the option -O2 or above implies -fstrict-aliasing which causes the code given above to complain:
supersuds.cpp:31: warning: dereferencing pointer ‘v64’ does break strict-aliasing rules
supersuds.cpp:30: note: initialized from here
This can be remedied by supplying additional type attributes as follows:
typedef int64_t __attribute__((__may_alias__)) alias_int64_t;
void print128(__m128i value) {
alias_int64_t *v64 = (int64_t*) &value;
printf("%.16lx %.16lx\n", v64[1], v64[0]);
}
I first tried the attribute directly without the typedef. It was accepted, but I still got the warning. The typedef seems to be a necessary piece of the magic.
BTW, this is my second answer here and I still hate the fact that I can't yet tell where I'm permitted to edit, so I wasn't able to post this where it belonged.
And one more thing, under AMD64, the %llx format specifier needs to be changed to %lx.

Resources