Define a `static const` SIMD Variable within a `C` Function - c

I have a function in this form (From Fastest Implementation of Exponential Function Using SSE):
__m128 FastExpSse(__m128 x)
{
static __m128 const a = _mm_set1_ps(12102203.2f); // (1 << 23) / ln(2)
static __m128i const b = _mm_set1_epi32(127 * (1 << 23) - 486411);
static __m128 const m87 = _mm_set1_ps(-87);
// fast exponential function, x should be in [-87, 87]
__m128 mask = _mm_cmpge_ps(x, m87);
__m128i tmp = _mm_add_epi32(_mm_cvtps_epi32(_mm_mul_ps(a, x)), b);
return _mm_and_ps(_mm_castsi128_ps(tmp), mask);
}
I want to make it C compatible.
Yet the compiler doesn't accept the form static __m128i const b = _mm_set1_epi32(127 * (1 << 23) - 486411); when I use C compiler.
Yet I don't want the first 3 values to be recalculated in each function call.
One solution is to inline it (But sometimes the compilers reject that).
Is there a C style to achieve it in case the function isn't inlined?
Thank You.

Remove static and const.
Also remove them from the C++ version. const is OK, but static is horrible, introducing guard variables that are checked every time, and a very expensive initialization the first time.
__m128 a = _mm_set1_ps(12102203.2f); is not a function call, it's just a way to express a vector constant. No time can be saved by "doing it only once" - it normally happens zero times, with the constant vector being prepared in the data segment of the program and simply being loaded at runtime, without the junk around it that static introduces.
Check the asm to be sure, without static this is what happens: (from godbolt)
FastExpSse(float __vector(4)):
movaps xmm1, XMMWORD PTR .LC0[rip]
cmpleps xmm1, xmm0
mulps xmm0, XMMWORD PTR .LC1[rip]
cvtps2dq xmm0, xmm0
paddd xmm0, XMMWORD PTR .LC2[rip]
andps xmm0, xmm1
ret
.LC0:
.long 3266183168
.long 3266183168
.long 3266183168
.long 3266183168
.LC1:
.long 1262004795
.long 1262004795
.long 1262004795
.long 1262004795
.LC2:
.long 1064866805
.long 1064866805
.long 1064866805
.long 1064866805

_mm_set1_ps(-87); or any other _mm_set intrinsic is not a valid static initializer with current compilers, because it's not treated as a constant expression.
In C++, it compiles to runtime initialization of the static storage location (copying from a vector literal somewhere else). And if it's a static __m128 inside a function, there's a guard variable to protect it.
In C, it simply refuses to compile, because C doesn't support non-constant initializers / constructors. _mm_set is not like a braced initializer for the underlying GNU C native vector, like #benjarobin's answer shows.
This is really dumb, and seems to be a missed-optimization in all 4 mainstream x86 C++ compilers (gcc/clang/ICC/MSVC). Even if it somehow matters that each static const __m128 var have a distinct address, the compiler could achieve that by using initialized read-only storage instead of copying at runtime.
So it seems like constant propagation fails to go all the way to turning _mm_set into a constant initializer even when optimization is enabled.
Never use static const __m128 var = _mm_set... even in C++; it's inefficient.
Inside a function is even worse, but global scope is still bad.
Instead, avoid static. You can still use const to stop yourself from accidentally assigning something else, and to tell human readers that it's a constant. Without static, it has no effect on where/how your variable is stored. const on automatic storage just does compile-time checking that you don't modify the object.
const __m128 var = _mm_set1_ps(-87); // not static
Compilers are good at this, and will optimize the case where multiple functions use the same vector constant, the same way they de-duplicate string literals and put them in read-only memory.
Defining constants this way inside small helper functions is fine: compilers will hoist the constant-setup out of a loop after inlining the function.
It also lets compilers optimize away the full 16 bytes of storage, and load it with vbroadcastss xmm0, dword [mem], or stuff like that.

This solution is clearly not portable, it's working with GCC 8 (only tested with this compiler):
#include <stdio.h>
#include <stdint.h>
#include <emmintrin.h>
#include <string.h>
#define INIT_M128(vFloat) {(vFloat), (vFloat), (vFloat), (vFloat)}
#define INIT_M128I(vU32) {((uint64_t)(vU32) | (uint64_t)(vU32) << 32u), ((uint64_t)(vU32) | (uint64_t)(vU32) << 32u)}
static void print128(const void *p)
{
unsigned char buf[16];
memcpy(buf, p, 16);
for (int i = 0; i < 16; ++i)
{
printf("%02X ", buf[i]);
}
printf("\n");
}
int main(void)
{
static __m128 const glob_a = INIT_M128(12102203.2f);
static __m128i const glob_b = INIT_M128I(127 * (1 << 23) - 486411);
static __m128 const glob_m87 = INIT_M128(-87.0f);
__m128 a = _mm_set1_ps(12102203.2f);
__m128i b = _mm_set1_epi32(127 * (1 << 23) - 486411);
__m128 m87 = _mm_set1_ps(-87);
print128(&a);
print128(&glob_a);
print128(&b);
print128(&glob_b);
print128(&m87);
print128(&glob_m87);
return 0;
}
As explained in the answer of #harold (in C only), the following code (build with or without WITHSTATIC) produces exactly the same code.
#include <stdio.h>
#include <stdint.h>
#include <emmintrin.h>
#include <string.h>
#define INIT_M128(vFloat) {(vFloat), (vFloat), (vFloat), (vFloat)}
#define INIT_M128I(vU32) {((uint64_t)(vU32) | (uint64_t)(vU32) << 32u), ((uint64_t)(vU32) | (uint64_t)(vU32) << 32u)}
__m128 FastExpSse2(__m128 x)
{
#ifdef WITHSTATIC
static __m128 const a = INIT_M128(12102203.2f);
static __m128i const b = INIT_M128I(127 * (1 << 23) - 486411);
static __m128 const m87 = INIT_M128(-87.0f);
#else
__m128 a = _mm_set1_ps(12102203.2f);
__m128i b = _mm_set1_epi32(127 * (1 << 23) - 486411);
__m128 m87 = _mm_set1_ps(-87);
#endif
__m128 mask = _mm_cmpge_ps(x, m87);
__m128i tmp = _mm_add_epi32(_mm_cvtps_epi32(_mm_mul_ps(a, x)), b);
return _mm_and_ps(_mm_castsi128_ps(tmp), mask);
}
So in summary it's better to remove static and const keywords (better and simpler code in C++, and in C the code is portable since with my proposed hack the code is not really portable)

Related

How to get bits of specific xmm registers?

So I want to get the value or state of specific xmm registers. This is primarily for a crash log or just to see the state of the registers for debugging. I tried this, but it doesn't seem to work:
#include <x86intrin.h>
#include <stdio.h>
int main(void) {
register __m128i my_val __asm__("xmm0");
__asm__ ("" :"=r"(my_val));
printf("%llu %llu\n", my_val & 0xFFFFFFFFFFFFFFFF, my_val << 63);
return 0;
}
As far as I know, the store related intrinsics would not treat the __m128i as a POD data type but rather as a reference to one of the xmm registers.
How do I get and access the bits stored in the __m128i as 64 bit integers? Or does my __asm__ above work?
How do I get and access the bits stored in the __m128i as 64 bit integers?
You will have to convert the __m128i vector to a pair of uint64_t variables. You can do that with conversion intrinsics:
uint64_t lo = _mm_cvtsi128_si64(my_val);
uint64_t hi = _mm_cvtsi128_si64(_mm_unpackhi_epi64(my_val, my_val));
...or though memory:
uint64_t buf[2];
_mm_storeu_si128((__m128i*)buf, my_val);
uint64_t lo = buf[0];
uint64_t hi = buf[1];
The latter may be worse in terms of performance, but if you intend to use it only for debugging, it would do. It is also trivial to adapt to differently sized elements, if you need that.
Or does my __asm__ above work?
No, it doesn't. The "=r" output constraint does not allow vector registers, such as xmm0, which you pass as an output, it only allows general purpose registers. No general purpose registers are 128-bit wide, so that asm statement makes no sense.
Also, I should note that my_val << 63 shifts the value in the wrong way. If you wanted to output the high half of the hypothetical 128-bit value then you should've shifted right, not left. And besides that, shifts on vectors are either not implemented or act on each element of the vector rather than the vector as a whole, depending on the compiler. But this part is moot, as with the code above you don't need any shifts to output the two halves.
If you really want to know about register values, rather than __m128i C variable values, I'd suggest using a debugger like GDB. print /x $xmm0.v2_int64 when stopped at a breakpoint.
Capturing a register at the top of a function is a pretty flaky and unreliable thing to try to attempt (smells like you've already gone down the wrong design path)1. But you're on the right track with a register-asm local var. However, xmm0 can't match an "=r" constraint, only "=x". See Reading a register value into a C variable for more about using an empty asm template to tell the compiler you want a C variable to be what was in a register.
You do need the asm volatile("" : "=x"(var)); statement, though; GNU C register-asm local vars have no guarantees whatsoever except when used as operands to asm statements. (GCC will often keep your var in that register anyway, but IIRC clang won't.)
There's not a lot of guarantee about where this will be ordered wrt. other code (asm volatile may help some, or for stronger ordering also use a "memory" clobber). Also no guarantee that GCC won't use the register for something else first. (Especially a call-clobbered register like any xmm reg.) But it does at least happen to work in the version I tested.
print a __m128i variable shows how to print a __m128i as two 64-bit halves once you have it, or as other element sizes. The compiler will often optimize _mm_store_si128 / reload into shuffles, and this is for printing anyway so keep it simple.
Using a unsigned __int128 tmp; would also be an option in GNU C on x86-64.
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#ifndef __cplusplus
#include <stdalign.h>
#endif
// If you need this, you're probably doing something wrong.
// There's no guarantee about what a compiler will have in XMM0 at any point
void foo() {
register __m128i xmm0 __asm__("xmm0");
__asm__ volatile ("" :"=x"(xmm0));
alignas(16) uint64_t buf[2];
_mm_store_si128((__m128i*)buf, xmm0);
printf("%llu %llu\n", buf[1], buf[0]); // I'd normally use hex, like %#llx
}
This prints the high half first (most significant), so reading left to right across both elements we get each byte in descending order of memory address within buf.
It compiles to the asm we want with both GCC and clang (Godbolt), not stepping on xmm0 before reading it.
# GCC10.2 -O3
foo:
movhlps xmm1, xmm0
movq rdx, xmm0 # low half -> RDX
mov edi, OFFSET FLAT:.LC0
xor eax, eax
movq rsi, xmm1 # high half -> RSI
jmp printf
Footnote 1:
If you make sure your function doesn't inline, you could take advantage of the calling convention to get the incoming values of xmm0..7 (for x86-64 System V), or xmm0..3 if you have no integer args (Windows x64).
__attribute__((noinline))
void foo(__m128i xmm0, __m128i xmm1, __m128i xmm2, etc.) {
// do whatever you want with the xmm0..7 args
}
If you want to provide a different prototype for the function for callers to use (which omits the __m128i args), that can maybe work. It's of course Undefined Behaviour in ISO C, but if you truly stop inlining, the effects depend on the calling convention. As long as you make sure it's noinline so link-time optimization doesn't do cross-file inlining.
Of course, the mere fact of inserting a function call will change register allocation in the caller, so this only helps for a function you were going to call anyway.

Early-clobbers and named registers

I'm trying to understand the usage of "early-clobber outputs" but I stumbled upon a snipped which confuses me. Consider the following multiply-modulo function:
static inline uint64_t mulmod64(uint64_t a, uint64_t b, uint64_t n)
{
uint64_t d;
uint64_t unused;
asm ("mulq %3\n\t"
"divq %4"
:"=a"(unused), "=&d"(d)
:"a"(a), "rm"(b), "rm"(n)
:"cc");
return d;
}
Why has RDX the early-clobber flag (&)? Is it because mulq implicitly modified RDX? Would the example work without the flag? (I tried and it seems it does. But would it be correct as well?) On the other had, isn't it enough that the function outputs RDX to tell the compiler RDX was modified?
Also, why there is that unused variable? I assume it's there to denote that RAX was modified, correct? Can I remove it? (I tried and it seems to work.) I would have expected the correct way of marking the modified RAX is by including "rax" to "clobbers", along with "cc". But that does not work.
While this doesn't answer the question - I think the comments have it covered - I would simplify this, by letting the compiler choose registers vs memory, and allowing it to schedule mulq and divq as required... The problem is that div has register restrictions:
static inline uint64_t mulmod64(uint64_t a, uint64_t b, uint64_t n)
{
uint64_t ret, q, rh, rl;
__asm__ ("mulq %3" : "=a,a" (rl), "=d,d" (rh)
: "%0,0" (a), "r,m" (b) : "cc");
/* assert(rh < n), otherwise `div` raises a 'divide error' - the quotient is
* too large to store in in `%rax`. */
/* the "%0,0" notation implies that `(a)` and `(b)` are commutative.
* the "cc" clobber is implicit in gcc / clang asm (and, I expect, Intel icc)
* for the x86-64 asm statements. */
__asm__ ("divq %4" : "=a,a" (q), "=d,d" (ret)
: "0,0" (rl), "1,1" (rh), "r,m" (n), "cc");
return ret;
}

MPLAB XC16: Mixing C and Assembly

I am attempting to mix some C and assembly language and I am having a bear of a time. I am experienced with C, somewhat with assembly, but I haven't used them on the same project before.
At the moment, I am attempting to compile the simplest possible project, which is a Q1.15 fixed-point multiplication. I don't actually care about the code output, I just needed something to compile so that I could build off of it.
myq15.h:
#ifndef _Q15_MATH
#define _Q15_MATH
#include <stdint.h>
typedef int16_t q15_t;
extern q15_t q15_mul(q15_t multiplicand, q15_t multiplier);
q15_t q15_add(q15_t addend, q15_t adder);
#endif
myq15.c
#include "myq15.h"
q15_t q15_add(q15_t addend, q15_t adder){
int32_t result = (uint32_t)addend + (uint32_t)adder;
if(result > 32767) result = 32767;
else if(result < -32768) result = -32768;
return (q15_t)result;
}
myq15.s:
.include "xc.inc"
.text
.global _q15_mul
_q15_mul:
; w3:w2 = w1 * w0
mul.ss w0, w1, w2
; w0 = (w3:w2) >> 15
rlc w2, w2
rlc w3, w3
; w0 = w3
mov w3, w0
return
.end
My 'main' file simply calls a q15_add() and q15_mul() instance.
On compile, The linker states:
build/default/production/_ext/608098890/myq15.o(.text+0x0): In function `_q15_mul':
: multiple definition of `_q15_mul'
Again, I am trying to figure out how to mix the assembly and C file for other purposes, but if I can't get this simple program to work, I'm hopeless!
Thanks,
It's a bad idea to name your files the same, the toolchain might confuse them, given that both myq15.c and myq15.s compile to myq15.o by default.
You can use inline assembly in a c file. If you don't have a lot of assembly, this is probably the easiest way.
It's recommended to encapsulate any assembly into its own function call (which can be inlined) to prevent any side effects with optimization.
inline void SomeAssembly(void) {
asm volatile(“pwrsav #”);
asm volatile(“reset”);
}
int main (void) {
SomeAssembly();
while(1);
return -1;
}

Vector Sum using AVX Inline Assembly on XeonPhi

I am new to use XeonPhi Intel co-processor. I want to write code for a simple Vector sum using AVX 512 bit instructions. I use k1om-mpss-linux-gcc as a compiler and want to write inline assembly. Here it is my code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <assert.h>
#include <stdint.h>
void* aligned_malloc(size_t size, size_t alignment) {
uintptr_t r = (uintptr_t)malloc(size + --alignment + sizeof(uintptr_t));
uintptr_t t = r + sizeof(uintptr_t);
uintptr_t o =(t + alignment) & ~(uintptr_t)alignment;
if (!r) return NULL;
((uintptr_t*)o)[-1] = r;
return (void*)o;
}
int main(int argc, char* argv[])
{
printf("Starting calculation...\n");
int i;
const int length = 65536;
unsigned *A = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64);
unsigned *B = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64);
unsigned *C = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64);
for(i=0; i<length; i++){
A[i] = 1;
B[i] = 2;
}
const int AVXLength = length / 16;
unsigned char * pA = (unsigned char *) A;
unsigned char * pB = (unsigned char *) B;
unsigned char * pC = (unsigned char *) C;
for(i=0; i<AVXLength; i++ ){
__asm__("vmovdqa32 %1,%%zmm0\n"
"vmovdqa32 %2,%%zmm1\n"
"vpaddd %0,%%zmm0,%%zmm1;"
: "=m" (pC) : "m" (pA), "m" (pB));
pA += 64;
pB += 64;
pC += 64;
}
// To prove that the program actually worked
for (i=0; i <5 ; i++)
{
printf("C[%d] = %f\n", i, C[i]);
}
}
However when I run the program, I've got segmentation fault from the asm part. Can somebody help me with that???
Thanks
Xeon Phi Knights Corner doesn't support AVX. It only supports a special set of vector extensions, called Intel Initial Many Core Instructions (Intel IMCI) with a vector size of 512b. So trying to put any sort of AVX specific assembly into a KNC code will lead to crashes.
Just wait for Knights Landing. It will support AVX-512 vector extensions.
Although Knights Corner (KNC) does not have AVX512 it has something very similar. Many of the mnemonics are the same. In fact, in the OP's case the mnemoics vmovdqa32 and vpaddd are the same for AVX512 and KNC.
The opcodes likely differ but the compiler/assembler takes care of this. In the OPs case he/she is using a special version of GCC, k1om-mpss-linux-gcc which is part of the many core software stack KNC which presumably generates the correct opcodes. One can compile on the host using k1om-mpss-linux-gcc and then scp the binary to the KNC card. I learned about this from a comment in this question.
As to why the OPs code is failing I can only make guess since I don't have a KNC card to test with.
In my limited experience with GCC inline assembly I have learned that it's good to look at the generated assembly in the object file to make sure the compiler did what you expect.
When I compile your code with a normal version of GCC I see that the line "vpaddd %0,%%zmm0,%%zmm1;" produces assembly with the semicolon. I don't think the semicolon should be there. That could be one problem.
But since the OPs mnemonics are the same as AVX512 we can using AVX512 intrinsics to figure out the correct assembly
#include <x86intrin.h>
void foo(int *A, int *B, int *C) {
__m512i a16 = _mm512_load_epi32(A);
__m512i b16 = _mm512_load_epi32(B);
__m512i s16 = _mm512_add_epi32(a16,b16);
_mm512_store_epi32(C, s16);
}
and gcc -mavx512f -O3 -S knc.c procudes
vmovdqa64 (%rsi), %zmm0
vpaddd (%rdi), %zmm0, %zmm0
vmovdqa64 %zmm0, (%rdx)
GCC chose vmovdqa64 instead of vmovdqa32 even though the Intel documentaion says it should be vmovdqa32. I am not sure why. I don't know what the difference is. I could have used the intrinsic _mm512_load_si512 which does exist and according to Intel should map vmovdqa32 but GCC maps it to vmovdqa64 as well. I am not sure why there are also _mm512_load_epi32 and _mm512_load_epi64 now. SSE and AVX don't have these corresponding intrinsics.
Based on GCC's code here is the inline assembly I would use
__asm__ ("vmovdqa64 (%1), %%zmm0\n"
"vpaddd (%2), %%zmm0, %%zmm0\n"
"vmovdqa64 %%zmm0, (%0)"
:
: "r" (pC), "r" (pA), "r" (pB)
: "memory"
);
Maybe vmovdqa32 should be used instead of vmovdqa64 but I expect it does not matter.
I used the register modifier r instead of the memory modifier m because from past experience m the memory modifier did not produce the assembly I expected.
Another possibility to consider is to use a version of GCC that supports AVX512 intrinsics to generate the assembly and then use the special KNC version of GCC to convert the assembly to binary. For example
gcc-5.1 -O3 -S foo.c
k1om-mpss-linux-gcc foo.s
This may be asking for trouble since k1om-mpss-linux-gcc is likely an older version of GCC. I have never done something like this before but it may work.
As explained here the reason the AVX512 intrinsics
_mm512_load/store(u)_epi32
_mm512_load/store(u)_epi64
_mm512_load/store(u)_si512
is that the parameters have been converted to void*. For example with SSE you have to cast
int *x;
__m128i v;
__mm_store_si128((__m128*)x,v)
whereas with SSE you no longer need to
int *x;
__m512i;
__mm512_store_epi32(x,v);
//__mm512_store_si512(x,v); //this is also fine
It's still not clear to me why there is vmovdqa32 and vmovdqa64 (GCC only seems to use vmovdqa64 currently) but it's probably similar to movaps and movapd in SSE which have not real difference and exists only in case they may make a difference in the future.
The purpose of vmovdqa32 and vmovdqa64 is for masking which can be doing with these intrsics
_mm512_mask_load/store_epi32
_mm512_mask_load/store_epi64
Without masks the instructions are equivalent.

how to work with 128 bits C variable and xmm 128 bits asm?

in gcc, i want to do a 128 bits xor with 2 C variables, via asm code: how?
asm (
"movdqa %1, %%xmm1;"
"movdqa %0, %%xmm0;"
"pxor %%xmm1,%%xmm0;"
"movdqa %%xmm0, %0;"
:"=x"(buff) /* output operand */
:"x"(bu), "x"(buff)
:"%xmm0","%xmm1"
);
but i have a Segmentation fault error;
this is the objdump output:
movq -0x80(%rbp),%xmm2
movq -0x88(%rbp),%xmm3
movdqa %xmm2,%xmm1
movdqa %xmm2,%xmm0
pxor %xmm1,%xmm0
movdqa %xmm0,%xmm2
movq %xmm2,-0x78(%rbp)
You would see segfault issues if the variables aren't 16-byte aligned. The CPU can't MOVDQA to/from unaligned memory addresses, and would generate a processor-level "GP exception", prompting the OS to segfault your app.
C variables you declare (stack, global) or allocate on the heap aren't generally aligned to a 16 byte boundary, though occasionally you may get an aligned one by chance. You could direct the compiler to ensure proper alignment by using the __m128 or __m128i data types. Each of those declares a properly-aligned 128 bit value.
Further, reading the objdump, it looks like the compiler wrapped the asm sequence with code to copy the operands from the stack to the xmm2 and xmm3 registers using the MOVQ instruction, only to have your asm code then copy the values to xmm0 and xmm1. After xor-ing into xmm0, the wrapper copies the result to xmm2 only to then copy it back to the stack. Overall, not terribly efficient. MOVQ copies 8 bytes at a time, and expects (under some circumstances), an 8-byte aligned address. Getting an unaligned address, it could fail just like MOVDQA. The wrapper code, however, adds an aligned offset (-0x80, -0x88, and later -0x78) to the BP register, which may or may not contain an aligned value. Overall, there's no guaranty of alignment in the generated code.
The following ensures the arguments and result are stored in correctly aligned memory locations, and seems to work fine:
#include <stdio.h>
#include <emmintrin.h>
void print128(__m128i value) {
int64_t *v64 = (int64_t*) &value;
printf("%.16llx %.16llx\n", v64[1], v64[0]);
}
void main() {
__m128i a = _mm_setr_epi32(0x00ffff00, 0x00ffff00, 0x00ffff00, 0x10ffff00), /* low dword first! */
b = _mm_setr_epi32(0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff),
x;
asm (
"movdqa %1, %%xmm0;" /* xmm0 <- a */
"movdqa %2, %%xmm1;" /* xmm1 <- b */
"pxor %%xmm1, %%xmm0;" /* xmm0 <- xmm0 xor xmm1 */
"movdqa %%xmm0, %0;" /* x <- xmm0 */
:"=x"(x) /* output operand, %0 */
:"x"(a), "x"(b) /* input operands, %1, %2 */
:"%xmm0","%xmm1" /* clobbered registers */
);
/* printf the arguments and result as 2 64-bit hex values */
print128(a);
print128(b);
print128(x);
}
compile with (gcc, ubuntu 32 bit)
gcc -msse2 -o app app.c
output:
10ffff0000ffff00 00ffff0000ffff00
0000ffff0000ffff 0000ffff0000ffff
10ff00ff00ff00ff 00ff00ff00ff00ff
In the code above, _mm_setr_epi32 is used to initialize a and b with 128 bit values, as the compiler may not support 128 integer literals.
print128 writes out the hexadecimal representation of a 128 bit integer, as printf may not be able to do so.
The following is shorter and avoids some of the duplicate copying. The compiler adds its hidden wrapping movdqa's to make pxor %2,%0 magically work without you having to load the registers on your own:
#include <stdio.h>
#include <emmintrin.h>
void print128(__m128i value) {
int64_t *px = (int64_t*) &value;
printf("%.16llx %.16llx\n", px[1], px[0]);
}
void main() {
__m128i a = _mm_setr_epi32(0x00ffff00, 0x00ffff00, 0x00ffff00, 0x10ffff00),
b = _mm_setr_epi32(0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff);
asm (
"pxor %2, %0;" /* a <- b xor a */
:"=x"(a) /* output operand, %0 */
:"x"(a), "x"(b) /* input operands, %1, %2 */
);
print128(a);
}
compile as before:
gcc -msse2 -o app app.c
output:
10ff00ff00ff00ff 00ff00ff00ff00ff
Alternatively, if you'd like to avoid the inline assembly, you could use the SSE intrinsics instead (PDF). Those are inlined functions/macros that encapsulate MMX/SSE instructions with a C-like syntax. _mm_xor_si128 reduces your task to a single call:
#include <stdio.h>
#include <emmintrin.h>
void print128(__m128i value) {
int64_t *v64 = (int64_t*) &value;
printf("%.16llx %.16llx\n", v64[1], v64[0]);
}
void main()
{
__m128i x = _mm_xor_si128(
_mm_setr_epi32(0x00ffff00, 0x00ffff00, 0x00ffff00, 0x10ffff00), /* low dword first !*/
_mm_setr_epi32(0x0000ffff, 0x0000ffff, 0x0000ffff, 0x0000ffff));
print128(x);
}
compile:
gcc -msse2 -o app app.c
output:
10ff00ff00ff00ff 00ff00ff00ff00ff
Umm, why not use the __builtin_ia32_pxor intrinsic?
Under late model gcc (mine is 4.5.5) the option -O2 or above implies -fstrict-aliasing which causes the code given above to complain:
supersuds.cpp:31: warning: dereferencing pointer ‘v64’ does break strict-aliasing rules
supersuds.cpp:30: note: initialized from here
This can be remedied by supplying additional type attributes as follows:
typedef int64_t __attribute__((__may_alias__)) alias_int64_t;
void print128(__m128i value) {
alias_int64_t *v64 = (int64_t*) &value;
printf("%.16lx %.16lx\n", v64[1], v64[0]);
}
I first tried the attribute directly without the typedef. It was accepted, but I still got the warning. The typedef seems to be a necessary piece of the magic.
BTW, this is my second answer here and I still hate the fact that I can't yet tell where I'm permitted to edit, so I wasn't able to post this where it belonged.
And one more thing, under AMD64, the %llx format specifier needs to be changed to %lx.

Resources