Vector Sum using AVX Inline Assembly on XeonPhi - c

I am new to use XeonPhi Intel co-processor. I want to write code for a simple Vector sum using AVX 512 bit instructions. I use k1om-mpss-linux-gcc as a compiler and want to write inline assembly. Here it is my code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <assert.h>
#include <stdint.h>
void* aligned_malloc(size_t size, size_t alignment) {
uintptr_t r = (uintptr_t)malloc(size + --alignment + sizeof(uintptr_t));
uintptr_t t = r + sizeof(uintptr_t);
uintptr_t o =(t + alignment) & ~(uintptr_t)alignment;
if (!r) return NULL;
((uintptr_t*)o)[-1] = r;
return (void*)o;
}
int main(int argc, char* argv[])
{
printf("Starting calculation...\n");
int i;
const int length = 65536;
unsigned *A = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64);
unsigned *B = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64);
unsigned *C = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64);
for(i=0; i<length; i++){
A[i] = 1;
B[i] = 2;
}
const int AVXLength = length / 16;
unsigned char * pA = (unsigned char *) A;
unsigned char * pB = (unsigned char *) B;
unsigned char * pC = (unsigned char *) C;
for(i=0; i<AVXLength; i++ ){
__asm__("vmovdqa32 %1,%%zmm0\n"
"vmovdqa32 %2,%%zmm1\n"
"vpaddd %0,%%zmm0,%%zmm1;"
: "=m" (pC) : "m" (pA), "m" (pB));
pA += 64;
pB += 64;
pC += 64;
}
// To prove that the program actually worked
for (i=0; i <5 ; i++)
{
printf("C[%d] = %f\n", i, C[i]);
}
}
However when I run the program, I've got segmentation fault from the asm part. Can somebody help me with that???
Thanks

Xeon Phi Knights Corner doesn't support AVX. It only supports a special set of vector extensions, called Intel Initial Many Core Instructions (Intel IMCI) with a vector size of 512b. So trying to put any sort of AVX specific assembly into a KNC code will lead to crashes.
Just wait for Knights Landing. It will support AVX-512 vector extensions.

Although Knights Corner (KNC) does not have AVX512 it has something very similar. Many of the mnemonics are the same. In fact, in the OP's case the mnemoics vmovdqa32 and vpaddd are the same for AVX512 and KNC.
The opcodes likely differ but the compiler/assembler takes care of this. In the OPs case he/she is using a special version of GCC, k1om-mpss-linux-gcc which is part of the many core software stack KNC which presumably generates the correct opcodes. One can compile on the host using k1om-mpss-linux-gcc and then scp the binary to the KNC card. I learned about this from a comment in this question.
As to why the OPs code is failing I can only make guess since I don't have a KNC card to test with.
In my limited experience with GCC inline assembly I have learned that it's good to look at the generated assembly in the object file to make sure the compiler did what you expect.
When I compile your code with a normal version of GCC I see that the line "vpaddd %0,%%zmm0,%%zmm1;" produces assembly with the semicolon. I don't think the semicolon should be there. That could be one problem.
But since the OPs mnemonics are the same as AVX512 we can using AVX512 intrinsics to figure out the correct assembly
#include <x86intrin.h>
void foo(int *A, int *B, int *C) {
__m512i a16 = _mm512_load_epi32(A);
__m512i b16 = _mm512_load_epi32(B);
__m512i s16 = _mm512_add_epi32(a16,b16);
_mm512_store_epi32(C, s16);
}
and gcc -mavx512f -O3 -S knc.c procudes
vmovdqa64 (%rsi), %zmm0
vpaddd (%rdi), %zmm0, %zmm0
vmovdqa64 %zmm0, (%rdx)
GCC chose vmovdqa64 instead of vmovdqa32 even though the Intel documentaion says it should be vmovdqa32. I am not sure why. I don't know what the difference is. I could have used the intrinsic _mm512_load_si512 which does exist and according to Intel should map vmovdqa32 but GCC maps it to vmovdqa64 as well. I am not sure why there are also _mm512_load_epi32 and _mm512_load_epi64 now. SSE and AVX don't have these corresponding intrinsics.
Based on GCC's code here is the inline assembly I would use
__asm__ ("vmovdqa64 (%1), %%zmm0\n"
"vpaddd (%2), %%zmm0, %%zmm0\n"
"vmovdqa64 %%zmm0, (%0)"
:
: "r" (pC), "r" (pA), "r" (pB)
: "memory"
);
Maybe vmovdqa32 should be used instead of vmovdqa64 but I expect it does not matter.
I used the register modifier r instead of the memory modifier m because from past experience m the memory modifier did not produce the assembly I expected.
Another possibility to consider is to use a version of GCC that supports AVX512 intrinsics to generate the assembly and then use the special KNC version of GCC to convert the assembly to binary. For example
gcc-5.1 -O3 -S foo.c
k1om-mpss-linux-gcc foo.s
This may be asking for trouble since k1om-mpss-linux-gcc is likely an older version of GCC. I have never done something like this before but it may work.
As explained here the reason the AVX512 intrinsics
_mm512_load/store(u)_epi32
_mm512_load/store(u)_epi64
_mm512_load/store(u)_si512
is that the parameters have been converted to void*. For example with SSE you have to cast
int *x;
__m128i v;
__mm_store_si128((__m128*)x,v)
whereas with SSE you no longer need to
int *x;
__m512i;
__mm512_store_epi32(x,v);
//__mm512_store_si512(x,v); //this is also fine
It's still not clear to me why there is vmovdqa32 and vmovdqa64 (GCC only seems to use vmovdqa64 currently) but it's probably similar to movaps and movapd in SSE which have not real difference and exists only in case they may make a difference in the future.
The purpose of vmovdqa32 and vmovdqa64 is for masking which can be doing with these intrsics
_mm512_mask_load/store_epi32
_mm512_mask_load/store_epi64
Without masks the instructions are equivalent.

Related

Cross-compilng C program for ARMv8-A in Linux X86_64 system

I am new to ARM architecture,I am experimenting with cache clean of Arm.
I am following "Programmer’s Guide for ARMv8-A" since Gem-5 has this implementation as per (https://www.gem5.org/documentation/general_docs/architecture_support/arm_implementation/) ,
I am trying to cross-compile below code in linux x86_64 system using
arm-linux-gnueabi-gcc test_arm.c -o test,
but I am getting following error.
/tmp/ccTM2bcE.s: Assembler messages:
/tmp/ccTM2bcE.s:38: Error: selected processor does not support requested special purpose register -- `mrs r3,ctr_el0'
/tmp/ccTM2bcE.s:69: Error: bad instruction `dc cavu,r3'
/tmp/ccTM2bcE.s:150: Error: selected processor does not support `dsb ish' in ARM mode
/tmp/ccTM2bcE.s:159: Error: selected processor does not support `dsb ish' in ARM mode
code
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>
#include <stdint.h>
void clean_invalidate(uint64_t addr){
uint64_t ctr_el0 = 0;
if(ctr_el0 == 0)
asm volatile("mrs %0, ctr_el0":"=r"(ctr_el0)::);
const size_t dcache_line_size = 4 << ((ctr_el0 >>16)&15);
addr = addr & ~(dcache_line_size - 1);
asm volatile("dc cvau, %0"::"r"(addr):);
}
int main(){
int a[1000];
int index = 0;
uint64_t addr = 0;
double time_spend = 0.0;
clock_t begin = clock();
for(int i=0;i<100;i++){
index = rand()%1000;
a[index] = index;
addr = (uint64_t)(&a[index]);
asm volatile("dsb ish");
clean_invalidate(addr);
asm volatile("dsb ish");
int b = a[index];
}
clock_t end = clock();
time_spend = (double)(end-begin)/CLOCKS_PER_SEC;
printf("Time:%f\n",time_spend);
return 0;
}
Can someone please help me to compile this code for ARMv8-A in Linux X86 system.
PS: You can ignore the cast from pointer to integer of different size warning.
I think mrs %0,ctr_el0 is an ARMv8 aarch64 instruction, and arm-linux-gnueabi-gcc is the armv7/aarch32 compiler, you have to use aarch64-linux-gnu-gcc.
And dc cavu does not seem to exist, did you mean dc cvau?
With those two changes it compiles.
To be honest, there is also MRS in ARMv7 in addition to MRC, but I haven't fully understood when each one should be used in there. aarch64 has only MRS so it's simpler.
For the specific case of CTR_EL0, there exists an analogous aarch32 register CTR, but that one is accessed with MRC according to the manual, not MRS.
Here are a gazillion runnable examples that might be of interest as well:
https://cirosantilli.com/linux-kernel-module-cheat/#dump-regs
https://cirosantilli.com/linux-kernel-module-cheat/#arm-userland-assembly
The problem comes with the instruction:
asm volatile("mrs %0, ctr_el0":"=r"(ctr_el0)::);
which is translated to an assembler instruction, it is tie to your ARM architecture, for this you should take a look into your correspondign arm registers and see if it is included there, If not, then you need to find another register with a similar purpose

Vectorization - Speed up expected for SSE, AVX and AVX2

I am doing a benchmark about vectorization on MacOS with the following processor i7 :
$ sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i7-4960HQ CPU # 2.60GHz
My MacBook Pro is from middle 2014.
I tried to use different flag options for vectorization : the 3 ones that interest me are SSE, AVX and AVX2.
For my benchmark, I add each element of 2 arrays and store the sum in a third array.
I must make you notice that I am working with double type for these arrays.
Here are the functions used into my benchmark code :
1*) First with SSE vectorization :
#ifdef SSE
#include <x86intrin.h>
#define ALIGN 16
void addition_tab(int size, double *a, double *b, double *c)
{
int i;
// Main loop
for (i=size-1; i>=0; i-=2)
{
// Intrinsic SSE syntax
const __m128d x = _mm_load_pd(a); // Load two x elements
const __m128d y = _mm_load_pd(b); // Load two y elements
const __m128d sum = _mm_add_pd(x, y); // Compute two sum elements
_mm_store_pd(c, sum); // Store two sum elements
// Increment pointers by 2 since SSE vectorizes on 128 bits = 16 bytes = 2*sizeof(double)
a += 2;
b += 2;
c += 2;
}
}
#endif
2*) Second with AVX256 vectorization :
#ifdef AVX256
#include <immintrin.h>
#define ALIGN 32
void addition_tab(int size, double *a, double *b, double *c)
{
int i;
// Main loop
for (i=size-1; i>=0; i-=4)
{
// Intrinsic AVX syntax
const __m256d x = _mm256_load_pd(a); // Load two x elements
const __m256d y = _mm256_load_pd(b); // Load two y elements
const __m256d sum = _mm256_add_pd(x, y); // Compute two sum elements
_mm256_store_pd(c, sum); // Store two sum elements
// Increment pointers by 4 since AVX256 vectorizes on 256 bits = 32 bytes = 4*sizeof(double)
a += 4;
b += 4;
c += 4;
}
}
#endif
For SSE vectorization, I expect a Speedup equal around 2 because I align data on 128bits = 16 bytes = 2* sizeof(double).
What I get in results for SSE vectorization is represented on the following figure :
So, I think these results are valid because SpeedUp is around factor 2.
Now for AVX256, I get the following figure :
For AVX256 vectorization, I expect a Speedup equal around 4 because I align data on 256bits = 32 bytes = 4* sizeof(double).
But as you can see, I still get a factor 2 and not 4 for SpeedUp.
I don't understand why I get the same results for Speedup with SSE and AVX
vectorization.
Does it come from "compilation flags", from my model of processor, ... I don't know.
Here are the compilation command line that I have done for all above results :
For SSE :
gcc-mp-4.9 -DSSE -O3 -msse main_benchmark.c -o vectorizedExe
For AVX256 :
gcc-mp-4.9 -DAVX256 -O3 -Wa,-q -mavx main_benchmark.c -o vectorizedExe
Moreover, with my model of processor, could I use AVX512 vectorization ? (Once the issue of this question will be solved).
Thanks for your help
UPDATE 1
I tried the different options of #Mischa but still can't get a factor 4 for speedup with AVX flags and option. You can take a look at my C source on http://example.com/test_vectorization/main_benchmark.c.txt (with .txt extension for direct view into browser) and the shell script for benchmarking is http://example.com/test_vectorization/run_benchmark .
As said #Mischa, I try to apply the following commande line for compilation :
$GCC -O3 -Wa,-q -mavx -fprefetch-loop-arrays main_benchmark.c -o
vectorizedExe
but code genereated has not AVX instructions.
if you could you take a look at these files, this would be great. Thanks.
You are hitting the wall for cache->ram transfer. Your core7 has a 64 byte cache line. For sse2, 16 byte store requires a 64 byte load, update, and queue back to ram. 16 byte loads in ascending order benefit from automatic prefetch prediction, so you get some load benefit. Add mm_prefetch of destination memory; say, 256 bytes ahead of the next store. Same applies to avx2 32-byte stores.
NP. There are options:
(1) x86-specific code:
#include <emmintrin.h>
...
for (int i=size; ...) {
_mm_prefetch(256+(char*)c, _MM_HINT_T0);
...
_mm256_store_pd(c, sum);
(2) gcc-specific code:
for (int i=size; ...) {
__builtin_prefetch(c+32);
...
(3) gcc -fprefetch-array-loops --- the compiler knows best.
(3) is the best if your version of gcc supports it.
(2) is next-best, if you compile and run on same hardware.
(1) is portable to other compilers.
"256", unfortunately, is a guestimate, and hardware-dependent. 128 is a minimum, 512 a maximum, depending on your CPU:RAM speed. If you switch to _mm512*(), double those numbers.
If you are working across a range of processors, may I suggest compiling in a way that covers all cases, then test cpuid(ax=0)>=7, then cpuid(ax=7,cx=0):bx & 0x04000010 at runtime (0x10 for AVX2, 0x04000000 for AVX512 incl prefetch).
BTW if you are using gcc and specifying -mavx or -msse2, the compiler defines builtin macros __AVX__ or __SSE2__ for you; no need for -DAVX256. To support archaic 32-bit processors, -m32 unfortunately disables __SSE2__ hence effectively disables \#include <emmintrin.h> :-P
HTH

Intrinsics for 128 multiplication and division

In x86_64 I know that the mul and div opp codes support 128 integers by putting the lower 64 bits in the rax and the upper in the rdx registers. I was looking for some sort of intrinsic to do this in the intel intrinsics guide and I could not find one. I am writing a big number library where the word size is 64 bits. Right now I am doing division by a single word like this.
int ubi_div_i64(ubigint_t* a, ubi_i64_t b, ubi_i64_t* rem)
{
if(b == 0)
return UBI_MATH_ERR;
ubi_i64_t r = 0;
for(size_t i = a->used; i-- > 0;)
{
ubi_i64_t out;
__asm__("\t"
"div %[d] \n\t"
: "=a"(out), "=d"(r)
: "a"(a->data[i]), "d"(r), [d]"r"(b)
: "cc");
a->data[i] = out;
//ubi_i128_t top = (r << 64) + a->data[i];
//r = top % b;
//a->data[i] = top / b;
}
if(rem)
*rem = r;
return ubi_strip_leading_zeros(a);
}
It would be nice if I could use something in the x86intrinsics.h header instead of inline asm.
gcc has __int128 and __uint128 types.
Arithmetic with them should be using the right assembly instructions when they exist; I've used them in the past to get the upper 64 bits of a product, although I've never used it for division. If it's not using the right ones, submit a bug report / feature request as appropriate.
Last I looked into it the intrinsic were in a state of flux. The main reason for the intrinsics in this case appears to be due to the fact that MSVC in 64-bit mode does not allow inline assembly.
With MSVC (and I think ICC) you can use _umul128 for mul and _mulx_u64 for mulx. These don't work in GCC , at least not GCC 4.9 (_umul128 is much older than GCC 4.9). I don't know if GCC plans to support these since you can get mul and mulx indirectly through __int128 (depending on your compile options) or directly through inline assembly.
__int128 works fine until you need a larger type and a 128-bit carry. Then you need adc, adcx, or adox and these are even more of a problem with intrinsics. Intel's documentation disagree's with MSVC and the compilers don't seem to produce adox yet with these intrinsics. See this question: _addcarry_u64 and _addcarryx_u64 with MSVC and ICC.
Inline assembly is probably the best solution with GCC (and probably even ICC).

Error with gcc inline assembly

I'm trying to learn how to write gcc inline assembly.
The following code is supposed to perform an shl instruction and return the result.
#include <stdio.h>
#include <inttypes.h>
uint64_t rotate(uint64_t x, int b)
{
int left = x;
__asm__ ("shl %1, %0"
:"=r"(left)
:"i"(b), "0"(left));
return left;
}
int main()
{
uint64_t a = 1000000000;
uint64_t res = rotate(a, 10);
printf("%llu\n", res);
return 0;
}
Compilation fails with error: impossible constraint in asm
The problem is basically with "i"(b). I've tried "o", "n", "m" among others but it still doesn't work. Either its this error or operand size mismatch.
What am I doing wrong?
As written, you code compiles correctly for me (I have optimization enabled). However, I believe you may find this to be a bit better:
#include <stdio.h>
#include <inttypes.h>
uint64_t rotate(uint64_t x, int b)
{
__asm__ ("shl %b[shift], %[value]"
: [value] "+r"(x)
: [shift] "Jc"(b)
: "cc");
return x;
}
int main(int argc, char *argv[])
{
uint64_t a = 1000000000;
uint64_t res = rotate(a, 10);
printf("%llu\n", res);
return 0;
}
Note that the 'J' is for 64bit. If you are using 32bit, 'I' is the correct value.
Other things of note:
You are truncating your rotate value from uint64_t to int? Are you compiling for 32bit code? I don't believe shl can do 64bit rotates when compiled as 32bit.
Allowing 'c' on the input constraint means you can use variable rotate amounts (ie not hard-coded at compile time).
Since shl modifies the flags, use "cc" to let the compiler know.
Using the [name] form makes the asm easier to read (IMO).
The %b is a modifier. See https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#i386Operandmodifiers
If you want to really get smart about inline asm, check out the latest gcc docs: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html

PCLMULQDQ instruction in C inline asm

I want to use Intel's PCLMULQDQ instruction with inline assembly in my C Code for multiplying two polynomials, which are elements in GF(2^n). Compiler is GCC 4.8.1.
The polynomials are stored in arrays of uint32_t (6 fields big).
I already checked the web how to use the PCLMULQDQ instruction or CLMUL instruction set properly, but didn't found any good documentation.
I would really appreciate a simple example in C and asm of how to multiply two simple polynomials with the instruction. Does anybody know how to do it?
Besides are there any prerequisites (except a capable processor), like included libraries, compiler options etc.?
I already found a solution. Thus for the record:
void f2m_intel_mult(
uint32_t t, // length of arrays A and B
uint32_t *A,
uint32_t *B,
uint32_t *C
)
{
memset(C, 0, 2*t*sizeof(uint32_t));
uint32_t offset = 0;
union{ uint64_t val; struct{uint32_t low; uint32_t high;} halfs;} prod;
uint32_t i;
uint32_t j;
for(i=0; i<t; i++){
for(j=0; j<t; j++){
prod.halfs.low = A[i];
prod.halfs.high = 0;
asm ("pclmulqdq %2, %1, %0;"
: "+x"(prod.val)
: "x"(B[j]), "i"(offset)
);
C[i+j] = C[i+j] ^ prod.halfs.low;
C[i+j+1] = C[i+j+1] ^ prod.halfs.high;
}
}
}
I think it is possible to use 64bit registers for pclmulqdq, but I couldn't find out how to get this working with inline assembler. Does anybody know this?
Nevertheless it is also possible to do the same with intrinsics. (If you want the code just ask.)
Besides it is possible to optimize the calculation further with Karatsuba, if you know the size t of the arrays.

Resources