Vectorization - Speed up expected for SSE, AVX and AVX2 - c

I am doing a benchmark about vectorization on MacOS with the following processor i7 :
$ sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i7-4960HQ CPU # 2.60GHz
My MacBook Pro is from middle 2014.
I tried to use different flag options for vectorization : the 3 ones that interest me are SSE, AVX and AVX2.
For my benchmark, I add each element of 2 arrays and store the sum in a third array.
I must make you notice that I am working with double type for these arrays.
Here are the functions used into my benchmark code :
1*) First with SSE vectorization :
#ifdef SSE
#include <x86intrin.h>
#define ALIGN 16
void addition_tab(int size, double *a, double *b, double *c)
{
int i;
// Main loop
for (i=size-1; i>=0; i-=2)
{
// Intrinsic SSE syntax
const __m128d x = _mm_load_pd(a); // Load two x elements
const __m128d y = _mm_load_pd(b); // Load two y elements
const __m128d sum = _mm_add_pd(x, y); // Compute two sum elements
_mm_store_pd(c, sum); // Store two sum elements
// Increment pointers by 2 since SSE vectorizes on 128 bits = 16 bytes = 2*sizeof(double)
a += 2;
b += 2;
c += 2;
}
}
#endif
2*) Second with AVX256 vectorization :
#ifdef AVX256
#include <immintrin.h>
#define ALIGN 32
void addition_tab(int size, double *a, double *b, double *c)
{
int i;
// Main loop
for (i=size-1; i>=0; i-=4)
{
// Intrinsic AVX syntax
const __m256d x = _mm256_load_pd(a); // Load two x elements
const __m256d y = _mm256_load_pd(b); // Load two y elements
const __m256d sum = _mm256_add_pd(x, y); // Compute two sum elements
_mm256_store_pd(c, sum); // Store two sum elements
// Increment pointers by 4 since AVX256 vectorizes on 256 bits = 32 bytes = 4*sizeof(double)
a += 4;
b += 4;
c += 4;
}
}
#endif
For SSE vectorization, I expect a Speedup equal around 2 because I align data on 128bits = 16 bytes = 2* sizeof(double).
What I get in results for SSE vectorization is represented on the following figure :
So, I think these results are valid because SpeedUp is around factor 2.
Now for AVX256, I get the following figure :
For AVX256 vectorization, I expect a Speedup equal around 4 because I align data on 256bits = 32 bytes = 4* sizeof(double).
But as you can see, I still get a factor 2 and not 4 for SpeedUp.
I don't understand why I get the same results for Speedup with SSE and AVX
vectorization.
Does it come from "compilation flags", from my model of processor, ... I don't know.
Here are the compilation command line that I have done for all above results :
For SSE :
gcc-mp-4.9 -DSSE -O3 -msse main_benchmark.c -o vectorizedExe
For AVX256 :
gcc-mp-4.9 -DAVX256 -O3 -Wa,-q -mavx main_benchmark.c -o vectorizedExe
Moreover, with my model of processor, could I use AVX512 vectorization ? (Once the issue of this question will be solved).
Thanks for your help
UPDATE 1
I tried the different options of #Mischa but still can't get a factor 4 for speedup with AVX flags and option. You can take a look at my C source on http://example.com/test_vectorization/main_benchmark.c.txt (with .txt extension for direct view into browser) and the shell script for benchmarking is http://example.com/test_vectorization/run_benchmark .
As said #Mischa, I try to apply the following commande line for compilation :
$GCC -O3 -Wa,-q -mavx -fprefetch-loop-arrays main_benchmark.c -o
vectorizedExe
but code genereated has not AVX instructions.
if you could you take a look at these files, this would be great. Thanks.

You are hitting the wall for cache->ram transfer. Your core7 has a 64 byte cache line. For sse2, 16 byte store requires a 64 byte load, update, and queue back to ram. 16 byte loads in ascending order benefit from automatic prefetch prediction, so you get some load benefit. Add mm_prefetch of destination memory; say, 256 bytes ahead of the next store. Same applies to avx2 32-byte stores.

NP. There are options:
(1) x86-specific code:
#include <emmintrin.h>
...
for (int i=size; ...) {
_mm_prefetch(256+(char*)c, _MM_HINT_T0);
...
_mm256_store_pd(c, sum);
(2) gcc-specific code:
for (int i=size; ...) {
__builtin_prefetch(c+32);
...
(3) gcc -fprefetch-array-loops --- the compiler knows best.
(3) is the best if your version of gcc supports it.
(2) is next-best, if you compile and run on same hardware.
(1) is portable to other compilers.
"256", unfortunately, is a guestimate, and hardware-dependent. 128 is a minimum, 512 a maximum, depending on your CPU:RAM speed. If you switch to _mm512*(), double those numbers.
If you are working across a range of processors, may I suggest compiling in a way that covers all cases, then test cpuid(ax=0)>=7, then cpuid(ax=7,cx=0):bx & 0x04000010 at runtime (0x10 for AVX2, 0x04000000 for AVX512 incl prefetch).
BTW if you are using gcc and specifying -mavx or -msse2, the compiler defines builtin macros __AVX__ or __SSE2__ for you; no need for -DAVX256. To support archaic 32-bit processors, -m32 unfortunately disables __SSE2__ hence effectively disables \#include <emmintrin.h> :-P
HTH

Related

How to use AVX instructions to optimize ReLU written in C

I have the following simplified ReLU simulation code that I am trying to optimize. The code uses a ternary operation which is perhaps coming in the way of automatic vectorization by the compiler. How can I vectorize this code?
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <mkl.h>
void relu(double* &Amem, double* Z_curr, int bo)
{
for (int i=0; i<bo; ++i) {
Amem[i] = Z_curr[i] > 0 ? Z_curr[i] : Z_curr[i]*0.1;
}
}
int main()
{
int i, j;
int batch_size = 16384;
int output_dim = 21;
// double* Amem = new double[batch_size*output_dim];
// double* Z_curr = new double[batch_size*output_dim];
double* Amem = (double *)mkl_malloc(batch_size*output_dim*sizeof( double ), 64 );
double* Z_curr = (double *)mkl_malloc(batch_size*output_dim*sizeof( double ), 64 );
memset(Amem, 0, sizeof(double)*batch_size*output_dim);
for (i=0; i<batch_size*output_dim; ++i) {
Z_curr[i] = -1+2*((double)rand())/RAND_MAX;
}
relu(Amem, Z_curr, batch_size*output_dim);
}
To compile it, if you have MKL then use the following, otherwise plain g++ -O3.
g++ -O3 ex.cxx -L${MKLROOT}/lib/intel64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5
So far, I have tried adding -march=skylake-avx512 as a compiler option, but it does not vectorize the loop as I found using option -fopt-info-vec-all for compilation:
ex.cxx:9:16: missed: couldn't vectorize loop
ex.cxx:9:16: missed: not vectorized: control flow in loop.
ex.cxx:6:6: note: vectorized 0 loops in function.
ex.cxx:9:16: missed: couldn't vectorize loop
ex.cxx:9:16: missed: not vectorized: control flow in loop.
and this is the time it takes currently at my end:
time ./a.out
real 0m0.034s
user 0m0.026s
sys 0m0.009s
There is usually no benefit to pass a pointer by reference (unless you want to modify the pointer itself). Furthermore, you can help your compiler using the (non-standard) __restrict keyword, telling it that no aliasing happens between input and output (of course, this will likely give wrong results, if e.g., Amem == Z_curr+1 -- but Amem == Z_curr should (in this case) be fine).
void relu(double* __restrict Amem, double* Z_curr, int bo)
Using that alone, clang actually is capable of vectorizing your loop using vcmpltpd and masked moves (for some reasons, only using 256bit registers).
If you simplify your expression to std::max(Z_curr[i], 0.1*Z_curr[i]) even gcc easily is capable of vectorizing it: https://godbolt.org/z/eTv4PnMWb
Generally, I would suggest compiling crucial routines of your code with different compilers and different compile options (sometimes trying -ffast-math can show you ways to simplify your expressions) and have a look at the generated code. For portability you could then re-translate the generated code into intrinsics (or leave it as is, if every compiler you care about gives good-enough results).
For completeness, here is a possible manually vectorized implementation using intrinsics:
void relu_avx512(double* __restrict Amem, double* Z_curr, int bo)
{
int i;
for (i=0; i<=bo-8; i+=8)
{
__m512d z = _mm512_loadu_pd(Z_curr+i);
__mmask8 positive = _mm512_cmplt_pd_mask (_mm512_setzero_pd(), z);
__m512d res = _mm512_mask_mul_pd(z, positive, z, _mm512_set1_pd(0.9));
_mm512_storeu_pd(Amem+i, res);
}
// remaining elements in scalar loop
for (; i<bo; ++i) {
Amem[i] = 0.0 < Z_curr[i] ? Z_curr[i] : Z_curr[i]*0.1;;
}
}
Godbolt: https://godbolt.org/z/s6br5KEEc (if you compile this with -O2 or -O3 on clang, it will heavily unroll the cleanup loop, even though it can't have more than 7 iterations. Theoretically, you could do the remaining elements with a masked or overlapping store (or maybe you have use-cases where the size is guaranteed to be a multiple of 8 and you can leave it away).

256-bit vectorization via OpenMP SIMD prevents compiler's optimization (say function inlining)?

Consider the following toy example, where A is an n x 2 matrix stored in column-major order and I want to compute its column sum. sum_0 only computes sum of the 1st column, while sum_1 does the 2nd column as well. This is really an artificial example, as there is essentially no need to define two functions for this task (I can write a single function with a double loop nest where the outer loop iterates from 0 to j). It is constructed to demonstrate the template problem I have in reality.
/* "test.c" */
#include <stdlib.h>
// j can be 0 or 1
static inline void sum_template (size_t j, size_t n, double *A, double *c) {
if (n == 0) return;
size_t i;
double *a = A, *b = A + n;
double c0 = 0.0, c1 = 0.0;
#pragma omp simd reduction (+: c0, c1) aligned (a, b: 32)
for (i = 0; i < n; i++) {
c0 += a[i];
if (j > 0) c1 += b[i];
}
c[0] = c0;
if (j > 0) c[1] = c1;
}
#define macro_define_sum(FUN, j) \
void FUN (size_t n, double *A, double *c) { \
sum_template(j, n, A, c); \
}
macro_define_sum(sum_0, 0)
macro_define_sum(sum_1, 1)
If I compile it with
gcc -O2 -mavx test.c
GCC (say the latest 8.2), after inlining, constant propagation and dead code elimination, would optimize out code involving c1 for function sum_0 (Check it on Godbolt).
I like this trick. By writing a single template function and passing in different configuration parameters, an optimizing compiler can generate different versions. It is much cleaner than copying-and-pasting a big proportion of the code and manually define different function versions.
However, such convenience is lost if I activate OpenMP 4.0+ with
gcc -O2 -mavx -fopenmp test.c
sum_template is inlined no more and no dead code elimination is applied (Check it on Godbolt). But if I remove flag -mavx to work with 128-bit SIMD, compiler optimization works as I expect (Check it on Godbolt). So is this a bug? I am on an x86-64 (Sandybridge).
Remark
Using GCC's auto-vectorization -ftree-vectorize -ffast-math would not have this issue (Check it on Godbolt). But I wish to use OpenMP because it allows portable alignment pragma across different compilers.
Background
I write modules for an R package, which needs be portable across platforms and compilers. Writing R extension requires no Makefile. When R is built on a platform, it knows what the default compiler is on that platform, and configures a set of default compilation flags. R does not have auto-vectorization flag but it has OpenMP flag. This means that using OpenMP SIMD is the ideal way to utilize SIMD in an R package. See 1 and 2 for a bit more elaboration.
The simplest way to solve this problem is with __attribute__((always_inline)), or other compiler-specific overrides.
#ifdef __GNUC__
#define ALWAYS_INLINE __attribute__((always_inline)) inline
#elif defined(_MSC_VER)
#define ALWAYS_INLINE __forceinline inline
#else
#define ALWAYS_INLINE inline // cross your fingers
#endif
ALWAYS_INLINE
static inline void sum_template (size_t j, size_t n, double *A, double *c) {
...
}
Godbolt proof that it works.
Also, don't forget to use -mtune=haswell, not just -mavx. It's usually a good idea. (However, promising aligned data will stop gcc's default -mavx256-split-unaligned-load tuning from splitting 256-bit loads into 128-bit vmovupd + vinsertf128, so code gen for this function is fine with tune=haswell. But normally you want this for gcc to auto-vectorize any other functions.
You don't really need static along with inline; if a compiler decides not to inline it, it can at least share the same definition across compilation units.
Normally gcc decides to inline or not according to function-size heuristics. But even setting -finline-limit=90000 doesn't get gcc to inline with your #pragma omp (How do I force gcc to inline a function?). I had been guessing that gcc didn't realize that constant-propagation after inlining would simplify the conditional, but 90000 "pseudo-instructions" seems plenty big. There could be other heuristics.
Possibly OpenMP sets some per-function stuff differently in ways that could break the optimizer if it let them inline into other functions. Using __attribute__((target("avx"))) stops that function from inlining into functions compiled without AVX (so you can do runtime dispatching safely, without inlining "infecting" other functions with AVX instructions across if(avx) conditions.)
One thing OpenMP does that you don't get with regular auto-vectorization is that reductions can be vectorized without enabling -ffast-math.
Unfortunately OpenMP still doesn't bother to unroll with multiple accumulators or anything to hide FP latency. #pragma omp is a pretty good hint that a loop is actually hot and worth spending code-size on, so gcc should really do that, even without -fprofile-use.
So especially if this ever runs on data that's hot in L2 or L1 cache (or maybe L3), you should do something to get better throughput.
And BTW, alignment isn't usually a huge deal for AVX on Haswell. But 64-byte alignment does matter a lot more in practice for AVX512 on SKX. Like maybe 20% slowdown for misaligned data, instead of a couple %.
(But promising alignment at compile time is a separate issue from actually having your data aligned at runtime. Both are helpful, but promising alignment at compile time makes tighter code with gcc7 and earlier, or on any compiler without AVX.)
I desperately needed to resolve this issue, because in my real C project, if no template trick were used for auto generation of different function versions (simply called "versioning" hereafter), I would need to write a total of 1400 lines of code for 9 different versions, instead of just 200 lines for a single template.
I was able to find a way out, and am now posting a solution using the toy example in the question.
I planed to utilize an inline function sum_template for versioning. If successful, it occurs at compile time when a compiler performs optimization. However, OpenMP pragma turns out to fail this compile time versioning. The option is then to do versioning at the pre-processing stage using macros only.
To get rid of the inline function sum_template, I manually inline it in the macro macro_define_sum:
#include <stdlib.h>
// j can be 0 or 1
#define macro_define_sum(FUN, j) \
void FUN (size_t n, double *A, double *c) { \
if (n == 0) return; \
size_t i; \
double *a = A, * b = A + n; \
double c0 = 0.0, c1 = 0.0; \
#pragma omp simd reduction (+: c0, c1) aligned (a, b: 32) \
for (i = 0; i < n; i++) { \
c0 += a[i]; \
if (j > 0) c1 += b[i]; \
} \
c[0] = c0; \
if (j > 0) c[1] = c1; \
}
macro_define_sum(sum_0, 0)
macro_define_sum(sum_1, 1)
In this macro-only version, j is directly substituted by 0 or 1 at during macro expansion. Whereas in the inline function + macro approach in the question, I only have sum_template(0, n, a, b, c) or sum_template(1, n, a, b, c) at pre-processing stage, and j in the body of sum_template is only propagated at the later compile time.
Unfortunately, the above macro gives error. I can not define or test a macro inside another (see 1, 2, 3). The OpenMP pragma starting with # is causing problem here. So I have to split this template into two parts: the part before the pragma and the part after.
#include <stdlib.h>
#define macro_before_pragma \
if (n == 0) return; \
size_t i; \
double *a = A, * b = A + n; \
double c0 = 0.0, c1 = 0.0;
#define macro_after_pragma(j) \
for (i = 0; i < n; i++) { \
c0 += a[i]; \
if (j > 0) c1 += b[i]; \
} \
c[0] = c0; \
if (j > 0) c[1] = c1;
void sum_0 (size_t n, double *A, double *c) {
macro_before_pragma
#pragma omp simd reduction (+: c0) aligned (a: 32)
macro_after_pragma(0)
}
void sum_1 (size_t n, double *A, double *c) {
macro_before_pragma
#pragma omp simd reduction (+: c0, c1) aligned (a, b: 32)
macro_after_pragma(1)
}
I no long need macro_define_sum. I can define sum_0 and sum_1 straightaway using the defined two macros. I can also adjust the pragma appropriately. Here instead of having a template function, I have templates for code blocks of a function and can reuse them with ease.
The compiler output is as expected in this case (Check it on Godbolt).
Update
Thanks for the various feedback; they are all very constructive (this is why I love Stack Overflow).
Thanks Marc Glisse for point me to Using an openmp pragma inside #define. Yeah, it was my bad to not have searched this issue. #pragma is an directive, not a real macro, so there must be some way to put it inside a macro. Here is the neat version using the _Pragma operator:
/* "neat.c" */
#include <stdlib.h>
// stringizing: https://gcc.gnu.org/onlinedocs/cpp/Stringizing.html
#define str(s) #s
// j can be 0 or 1
#define macro_define_sum(j, alignment) \
void sum_ ## j (size_t n, double *A, double *c) { \
if (n == 0) return; \
size_t i; \
double *a = A, * b = A + n; \
double c0 = 0.0, c1 = 0.0; \
_Pragma(str(omp simd reduction (+: c0, c1) aligned (a, b: alignment))) \
for (i = 0; i < n; i++) { \
c0 += a[i]; \
if (j > 0) c1 += b[i]; \
} \
c[0] = c0; \
if (j > 0) c[1] = c1; \
}
macro_define_sum(0, 32)
macro_define_sum(1, 32)
Other changes include:
I used token concatenation to generate function name;
alignment is made a macro argument. For AVX, a value of 32 means good alignment, while a value of 8 (sizeof(double)) essentially implies no alignment. Stringizing is required to parse those tokens into strings that _Pragma requires.
Use gcc -E neat.c to inspect pre-processing result. Compilation gives desired assembly output (Check it on Godbolt).
A few comments on Peter Cordes informative answer
Using complier's function attributes. I am not a professional C programmer. My experiences with C come merely from writing R extensions. The development environment determines that I am not very familiar with compiler attributes. I know some, but don't really use them.
-mavx256-split-unaligned-load is not an issue in my application, because I will allocate aligned memory and apply padding to ensure alignment. I just need to promise compiler of the alignment so that it can generate aligned load / store instructions. I do need to do some vectorization on unaligned data, but that contributes to a very limited part of the whole computation. Even if I get a performance penalty on split unaligned load it won't be noticed in reality. I also don't compiler every C file with auto vectorization. I only do SIMD when the operation is hot on L1 cache (i.e., it is CPU-bound not memory-bound). By the way, -mavx256-split-unaligned-load is for GCC; what is it for other compilers?
I am aware of the difference between static inline and inline. If an inline function is only accessed by one file, I will declare it as static so that compiler does not generate a copy of it.
OpenMP SIMD can do reduction efficiently even without GCC's -ffast-math. However, it does not use horizontal addition to aggregate results inside the accumulator register in the end of the reduction; it runs a scalar loop to add up each double word (see code block .L5 and .L27 in Godbolt output).
Throughput is a good point (especially for floating-point arithmetics which has relatively big latency but high throughput). My real C code where SIMD is applied is a triple loop nest. I unroll outer two loops to enlarge the code block in the innermost loop to enhance throughput. Vectorization of the innermost one is then sufficient. With the toy example in this Q & A where I just sum an array, I can use -funroll-loops to ask GCC for loop unrolling, using several accumulators to enhance throughput.
On this Q & A
I think most people would treat this Q & A in a more technical way than me. They might be interested in using compiler attributes or tweaking compiler flags / parameters to force function inlining. Therefore, Peter's answer as well as Marc's comment under the answer is still very valuable. Thanks again.

Vector Sum using AVX Inline Assembly on XeonPhi

I am new to use XeonPhi Intel co-processor. I want to write code for a simple Vector sum using AVX 512 bit instructions. I use k1om-mpss-linux-gcc as a compiler and want to write inline assembly. Here it is my code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#include <assert.h>
#include <stdint.h>
void* aligned_malloc(size_t size, size_t alignment) {
uintptr_t r = (uintptr_t)malloc(size + --alignment + sizeof(uintptr_t));
uintptr_t t = r + sizeof(uintptr_t);
uintptr_t o =(t + alignment) & ~(uintptr_t)alignment;
if (!r) return NULL;
((uintptr_t*)o)[-1] = r;
return (void*)o;
}
int main(int argc, char* argv[])
{
printf("Starting calculation...\n");
int i;
const int length = 65536;
unsigned *A = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64);
unsigned *B = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64);
unsigned *C = (unsigned*) aligned_malloc(length * sizeof(unsigned), 64);
for(i=0; i<length; i++){
A[i] = 1;
B[i] = 2;
}
const int AVXLength = length / 16;
unsigned char * pA = (unsigned char *) A;
unsigned char * pB = (unsigned char *) B;
unsigned char * pC = (unsigned char *) C;
for(i=0; i<AVXLength; i++ ){
__asm__("vmovdqa32 %1,%%zmm0\n"
"vmovdqa32 %2,%%zmm1\n"
"vpaddd %0,%%zmm0,%%zmm1;"
: "=m" (pC) : "m" (pA), "m" (pB));
pA += 64;
pB += 64;
pC += 64;
}
// To prove that the program actually worked
for (i=0; i <5 ; i++)
{
printf("C[%d] = %f\n", i, C[i]);
}
}
However when I run the program, I've got segmentation fault from the asm part. Can somebody help me with that???
Thanks
Xeon Phi Knights Corner doesn't support AVX. It only supports a special set of vector extensions, called Intel Initial Many Core Instructions (Intel IMCI) with a vector size of 512b. So trying to put any sort of AVX specific assembly into a KNC code will lead to crashes.
Just wait for Knights Landing. It will support AVX-512 vector extensions.
Although Knights Corner (KNC) does not have AVX512 it has something very similar. Many of the mnemonics are the same. In fact, in the OP's case the mnemoics vmovdqa32 and vpaddd are the same for AVX512 and KNC.
The opcodes likely differ but the compiler/assembler takes care of this. In the OPs case he/she is using a special version of GCC, k1om-mpss-linux-gcc which is part of the many core software stack KNC which presumably generates the correct opcodes. One can compile on the host using k1om-mpss-linux-gcc and then scp the binary to the KNC card. I learned about this from a comment in this question.
As to why the OPs code is failing I can only make guess since I don't have a KNC card to test with.
In my limited experience with GCC inline assembly I have learned that it's good to look at the generated assembly in the object file to make sure the compiler did what you expect.
When I compile your code with a normal version of GCC I see that the line "vpaddd %0,%%zmm0,%%zmm1;" produces assembly with the semicolon. I don't think the semicolon should be there. That could be one problem.
But since the OPs mnemonics are the same as AVX512 we can using AVX512 intrinsics to figure out the correct assembly
#include <x86intrin.h>
void foo(int *A, int *B, int *C) {
__m512i a16 = _mm512_load_epi32(A);
__m512i b16 = _mm512_load_epi32(B);
__m512i s16 = _mm512_add_epi32(a16,b16);
_mm512_store_epi32(C, s16);
}
and gcc -mavx512f -O3 -S knc.c procudes
vmovdqa64 (%rsi), %zmm0
vpaddd (%rdi), %zmm0, %zmm0
vmovdqa64 %zmm0, (%rdx)
GCC chose vmovdqa64 instead of vmovdqa32 even though the Intel documentaion says it should be vmovdqa32. I am not sure why. I don't know what the difference is. I could have used the intrinsic _mm512_load_si512 which does exist and according to Intel should map vmovdqa32 but GCC maps it to vmovdqa64 as well. I am not sure why there are also _mm512_load_epi32 and _mm512_load_epi64 now. SSE and AVX don't have these corresponding intrinsics.
Based on GCC's code here is the inline assembly I would use
__asm__ ("vmovdqa64 (%1), %%zmm0\n"
"vpaddd (%2), %%zmm0, %%zmm0\n"
"vmovdqa64 %%zmm0, (%0)"
:
: "r" (pC), "r" (pA), "r" (pB)
: "memory"
);
Maybe vmovdqa32 should be used instead of vmovdqa64 but I expect it does not matter.
I used the register modifier r instead of the memory modifier m because from past experience m the memory modifier did not produce the assembly I expected.
Another possibility to consider is to use a version of GCC that supports AVX512 intrinsics to generate the assembly and then use the special KNC version of GCC to convert the assembly to binary. For example
gcc-5.1 -O3 -S foo.c
k1om-mpss-linux-gcc foo.s
This may be asking for trouble since k1om-mpss-linux-gcc is likely an older version of GCC. I have never done something like this before but it may work.
As explained here the reason the AVX512 intrinsics
_mm512_load/store(u)_epi32
_mm512_load/store(u)_epi64
_mm512_load/store(u)_si512
is that the parameters have been converted to void*. For example with SSE you have to cast
int *x;
__m128i v;
__mm_store_si128((__m128*)x,v)
whereas with SSE you no longer need to
int *x;
__m512i;
__mm512_store_epi32(x,v);
//__mm512_store_si512(x,v); //this is also fine
It's still not clear to me why there is vmovdqa32 and vmovdqa64 (GCC only seems to use vmovdqa64 currently) but it's probably similar to movaps and movapd in SSE which have not real difference and exists only in case they may make a difference in the future.
The purpose of vmovdqa32 and vmovdqa64 is for masking which can be doing with these intrsics
_mm512_mask_load/store_epi32
_mm512_mask_load/store_epi64
Without masks the instructions are equivalent.

Intrinsics for 128 multiplication and division

In x86_64 I know that the mul and div opp codes support 128 integers by putting the lower 64 bits in the rax and the upper in the rdx registers. I was looking for some sort of intrinsic to do this in the intel intrinsics guide and I could not find one. I am writing a big number library where the word size is 64 bits. Right now I am doing division by a single word like this.
int ubi_div_i64(ubigint_t* a, ubi_i64_t b, ubi_i64_t* rem)
{
if(b == 0)
return UBI_MATH_ERR;
ubi_i64_t r = 0;
for(size_t i = a->used; i-- > 0;)
{
ubi_i64_t out;
__asm__("\t"
"div %[d] \n\t"
: "=a"(out), "=d"(r)
: "a"(a->data[i]), "d"(r), [d]"r"(b)
: "cc");
a->data[i] = out;
//ubi_i128_t top = (r << 64) + a->data[i];
//r = top % b;
//a->data[i] = top / b;
}
if(rem)
*rem = r;
return ubi_strip_leading_zeros(a);
}
It would be nice if I could use something in the x86intrinsics.h header instead of inline asm.
gcc has __int128 and __uint128 types.
Arithmetic with them should be using the right assembly instructions when they exist; I've used them in the past to get the upper 64 bits of a product, although I've never used it for division. If it's not using the right ones, submit a bug report / feature request as appropriate.
Last I looked into it the intrinsic were in a state of flux. The main reason for the intrinsics in this case appears to be due to the fact that MSVC in 64-bit mode does not allow inline assembly.
With MSVC (and I think ICC) you can use _umul128 for mul and _mulx_u64 for mulx. These don't work in GCC , at least not GCC 4.9 (_umul128 is much older than GCC 4.9). I don't know if GCC plans to support these since you can get mul and mulx indirectly through __int128 (depending on your compile options) or directly through inline assembly.
__int128 works fine until you need a larger type and a 128-bit carry. Then you need adc, adcx, or adox and these are even more of a problem with intrinsics. Intel's documentation disagree's with MSVC and the compilers don't seem to produce adox yet with these intrinsics. See this question: _addcarry_u64 and _addcarryx_u64 with MSVC and ICC.
Inline assembly is probably the best solution with GCC (and probably even ICC).

How to combine two __m128 values to __m256?

I would like to combine two __m128 values to one __m256.
Something like this:
__m128 a = _mm_set_ps(1, 2, 3, 4);
__m128 b = _mm_set_ps(5, 6, 7, 8);
to something like:
__m256 c = { 1, 2, 3, 4, 5, 6, 7, 8 };
are there any intrinsics that I can use to do this?
This should do what you want:
__m128 a = _mm_set_ps(1,2,3,4);
__m128 b = _mm_set_ps(5,6,7,8);
__m256 c = _mm256_castps128_ps256(a);
c = _mm256_insertf128_ps(c,b,1);
If the order is reversed from what you want, then just switch a and b.
The intrinsic of interest is _mm256_insertf128_ps which will let you insert a 128-bit register into either lower or upper half of a 256-bit AVX register:
http://software.intel.com/sites/products/documentation/studio/composer/en-us/2011/compiler_c/intref_cls/common/intref_avx_insertf128_ps.htm
The complete family of them is here:
_mm256_insertf128_pd()
_mm256_insertf128_ps()
_mm256_insertf128_si256()
Intel documents __m256 _mm256_set_m128(__m128 hi, __m128 lo) and _mm256_setr_m128(lo, hi) as intrinsics for the vinsertf128 instruction, which is what you want1. (Of course there are also __m256d and __m256i versions, which use the same instruction. The __m256i version may use vinserti128 if AVX2 is available, otherwise it'll use f128 as well.)
These days, those intrinsics are supported by current versions of all 4 major x86 compilers (gcc, clang, MSVC, and ICC). But not by older versions; like some other helper intrinsics that Intel documents, widespread implementation has been slow. (Often GCC or clang are the last hold-out to not have something you wish you could use portably.)
Use it if you don't need portability to old GCC versions: it's the most readable way to express what you want, following the well known _mm_set and _mm_setr patterns.
Performance-wise, it's of course just as efficient as manual cast + vinsertf128 intrinsics (#Mysticial's answer), and for gcc at least that's literally how the internal .h actually implements _mm256_set_m128.
Compiler version support for _mm256_set_m128 / _mm256_setr_m128:
clang: 3.6 and newer. (Mainline, IDK about Apple)
GCC: 8.x and newer, not present as recently as GCC7!
ICC: since at least ICC13, the earliest on Godbolt.
MSVC: since at least 19.14 and 19.10 (WINE) VS2015, the earliest on Godbolt.
https://godbolt.org/z/1na1qr has test cases for all 4 compilers.
__m256 combine_testcase(__m128 hi, __m128 lo) {
return _mm256_set_m128(hi, lo);
}
They all compile this function to one vinsertf128, except MSVC where even the latest version wastes a vmovups xmm2, xmm1 copying a register. (I used -O2 -Gv -arch:AVX to use the vectorcall convention so args would be in registers to make an efficient non-inlined function definition possible for MSVC.) Presumably MSVC would be ok inlining into a larger function if it could write the result to a 3rd register, instead of the calling convention forcing it to read xmm0 and write ymm0.
Footnote 1:
vinsertf128 is very efficient on Zen1, and as efficient as vperm2f128 on other CPUs with 256-bit-wide shuffle units. It can also take the high half from memory in case the compiler spilled it or is folding a _mm_loadu_ps into it, instead of needing to separately do a 128-bit load into a register; vperm2f128's memory operand would be a 256-bit load which you don't want.
https://uops.info/ / https://agner.org/optimize/
Even this one will work:
__m128 a = _mm_set_ps(1,2,3,4);
__m128 b = _mm_set_ps(5,6,7,8);
__m256 c = _mm256_insertf128_ps(c,a,0);
c = _mm256_insertf128_ps(c,b,1);
You will get a warning as c is not initialized but you can ignore it and if you're looking for performances this solution will use less clock cycle then the other one.
Can also use permute intrinsic:
__m128 a = _mm_set_ps(1,2,3,4);
__m128 b = _mm_set_ps(5,6,7,8);
__m256 c = _mm256_permute2f128_ps(_mm256_castps128_ps256(a), _mm256_castps128_ps256(b), 0x20);
I don't know which way is faster.
I believe this is the simplest:
#define _mm256_set_m128(/* __m128 */ hi, /* __m128 */ lo) \ _mm256_insertf128_ps(_mm256_castps128_ps256(lo), (hi), 0x1)
__m256 c = _mm256_set_m128(a, b);
Do note __mm256_set_m128 is already defined in msvc 2019 if you #include "immintrin.h"

Resources