How to overcome "existence of vector dependence" in icc - c

I want to vectorize following loop in C:
for(k = 0; k < SysData->numOfClaGen; k++)
A[k] = B[k] * cos(x1[2 * k] - x1[ind0 + k]);
where, there is no alias between variables and ind0 is a constant. None of the other pointers (A or B) point to ind0 and therefore, ind0 remains constant throughout the loop.
When I compile the code with icc, it says that this loop cannot be vectorized due to possible vector dependence. Here is the message:
loop was not vectorized: existence of vector dependence.
I narrowed the problem down and found out that replacing ind0 with a constant number solves the problem. So, I assume that icc thinks A may point to ind0 and therefore, ind0 may change.
I would like to know how I can help the compiler to know that it is safe to vectorized such loop.
Thanks in advance for your help.

Add #pragma ivdep in front of the for loop, it instructs the compiler to ignore assumed vector dependencies.
#pragma ivdep
for(k = 0; k < SysData->numOfClaGen; k++)
A[k] = B[k] * cos(x1[2 * k] - x1[ind0 + k]);
for more info about ivdep, see icc doc

Use of the restrict modifier for pointers asserts to the compiler that there is no aliasing. This keyword was introduced in C99. C++ does not support it, but many C++ compilers support __restrict as an equivalent proprietary extension. With the Intel compiler, one has to enable use of restrict by adding the command line flag -restrict (Linux) or /Qrestrict (Windows). In the following version of your code the loop is vectorized as desired when using Intel compiler version 13.1.3.198:
#include <math.h>
struct bar {
int numOfClaGen;
};
void foo (double * restrict A,
const double * restrict B,
const double * restrict x1,
const struct bar * restrict SysData,
const int ind0)
{
int k;
for (k = 0; k < SysData->numOfClaGen; k++) {
A[k] = B[k] * cos(x1[2 * k] - x1[ind0 + k]);
}
}
Invoking the compiler as follows (on a 64-bit Windows system)
icl /c /Ox /QxHost /Qrestrict /Qvec-report2 vectorize.c
the compiler reported
vectorize.c(14): (col. 5) remark: LOOP WAS VECTORIZED.

icc was changed a year ago to set -ansi-alias as a default for linux and Mac. For Windows, this default can't be counted on, as it conflicts with Microsoft usage. This option is equivalent to gcc -fstrict-aliasing, which has been a default since gcc 3.0. I think it's much better to set this option than to set ivdep restrict or simd for such a limited issue.
Although it's not well documented, icc treats __restrict the same as gcc and doesn't require the restrict or C99 option to accept it. In principle, it should come into play only for the objects being modified (A[] in the example above).
Strangely, __restrict has a slightly different meaning for MSVC++. It permits non-vector optimizations which might otherwise be prevented by possible dependencies, but doesn't enable vectorization (but it might apply to the present case).

Related

Why can GCC only do loop interchange optimization when the int size is a compile-time constant?

When I compile this snippet (with -Ofast -floop-nest-optimize) gcc generates assembly which traverses the array in source order.
However, if I uncomment the line // n = 32767 and assign any number to n, it interchanges the index order to x[i * n + j]. Traversing memory in contiguous row-major order is much more cache-friendly than striding down columns.
float matrix_sum_column_major(float* x, int n) {
// n = 32767;
float sum = 0;
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
sum += x[j * n + i];
return sum;
}
On godbolt
Why can't GCC or clang do loop interchange with a runtime-variable int size? Real-world code won't usually have the size declared explicitly.
PD: I've tried this with different versions of gcc and clang-9 and it seems to happen in both.
PD2: Even if I make x be a local variable malloced inside the function it still happens.
Compilers generally focus their efforts (and should focus their efforts) on places where constructs which will likely be used by programmers interested in efficiency can be replaced with other constructs that are easily proven to be equivalent in all cases that should be expected to matter. If n is a constant, a compiler can determine the exact set of array indices that will be used in the loop and then figure out how to process all those indices. If n isn't constant, a compiler might be able to determine that when n is positive, code will use all indices from 0 to n*n-1, but that would likely require a lot more effort. The authors of clang and might have been able to make such a determination in this case if they tried hard enough, but they likely thought the effort wasn't worthwhile.
Note that if code will use a few particular values of n far more than any others, having code explicitly check for those values and use loops tailored for them, a compiler might be able to generate far more efficient code for those loops than would be possible for loops that can use an arbitrary n. Because many real-world problems would likely have some values of n that get used much more than others, it would not be unreasonable for a compiler writer to assume that programmers interested in performance would be likely to use such special-purpose loops, and spending a certain amount of effort improving the arbitrary-n loop may offer less benefit than spending the same amount of effort elsewhere.

Why does clang is unable to unroll a loop (that gcc unrolls)?

I am writing in C and compiling using clang. I am trying to unroll a loop. The loop is not unrolled and there is a warning.
loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
You can find the results here: https://godbolt.org/z/4flN-k
int foo(int c)
{
size_t w = 0;
size_t i = sizeof(size_t);
#pragma unroll
while(i--)
{
w = (w << 8) | c;
}
return w;
}
GCC can unroll the loop with -O3 and thus I assume that clang should also unroll it.
I do not know but it can if you use the same options:
https://godbolt.org/z/VYn0CA
The inly difference is the size of the integer

How can I work around GCC optimization-miss bug 90271?

GCC versions released before May 2019 (and maybe later) fail to optimize this piece of code:
// Replace the k'th byte within an int
int replace_byte(int v1 ,char v2, size_t k)
{
memcpy( (void*) (((char*)&v1)+k) , &v2 , sizeof(v2) );
return v1;
}
as can be seen here (GodBolt): clang optimizes this code properly GCC and MSVC do not. This is GCC bug 90271, which will at some point be fixed. But - it won't be fixed for GCC versions that are out today, and I want to write this code today...
So: Is there a workaround which will make GCC generate the same code as clang for this function, or at least - code of comparable performance, keeping data in registers and not resorting to pointers and the stack?
Notes:
I marked this as C, since the code snippet is in C. I assume a workaround, if one exists, can also be implemented in C.
I'm interested in optimizing both the non-inlined function and the inlined version.
This question is related to this one, but only regards GCC and the specific approach in the piece of code here; and is in C rather than C++.
This makes the non-inlined version a little longer, but the inlined version is optimized for all three compilers:
int replace_bytes(int v1 ,char v2, size_t k)
{
return (v1 & ~(0xFF << k * 8)) | ((unsigned char)v2 << k * 8);
}
The cast of v2 to an unsigned char before the shift is necessary if char is a signed type. When that's the case, without the case v2 will be sign extended to an integer, which will cause unwanted bits to be set to 1 in the result.

Can I make #Pragma unroll accept macros/expressions rather than plain numbers?

I am trying to tell my compiler to unroll a loop for me using #pragma unroll. However, the number of iterations is determined by a compile-time variable, so the loop needs to be unrolled that many times. Like this:
#define ITEMS 4
#pragma unroll (ITEMS + 1)
for (unsigned int ii = 0; ii <= ITEMS; ++ii)
/* do something */;
The compiler doesn't like this, though, as it gives me the following warning: warning: extra characters in the unroll pragma (expected a single positive integer), ignoring pragma for this loop. I understand what this means, of course: it wants a single integer rather than an expression. Is there a way to do this, though, without changing the unroll parameter every time I change ITEMS?
The compiler I am using is CUDA's NVCC compiler.
You could do it the other way around:
Note: Just noticed Daniel Fischer's comment, which suggests exactly the same, before me.
#define ITEMS_PLUS_ONE 5
#define ITEMS (ITEMS_PLUS_ONE - 1)
The issue is that the preprocessor doesn't do math. It only does copy&paste.
When you write #define ITEMS_PLUS_ONE (ITEMS + 1), unroll is replaced with (4 + 1), not with 5.
Once this reaches the compiler, it doesn't matter. Even without optimization, the calculation is done during compilation, and (4 + 1) is exactly the same as 5.
But in your compiler, #pragma unroll is processed before compilation, and wants the simple number.

How to store the contents of a __m128d simd vector as doubles without accessing it as a union?

The code i want to optimize is basically a simple but large arithmetic formula, it should be fairly simple to analyze the code automatically to compute the independent multiplications/additions in parallel, but i read that autovectorization only works for loops.
I've read multiple times now that access of single elements in a vector via union or some other way should be avoided at all costs, instead should be replaced by a _mm_shuffle_pd (i'm working on doubles only)...
I don't seem to figure out how I can store the content of a __m128d vector as doubles without accessing it as a union. Also, does an operation like this give any performance gain when compared to scalar code?
union {
__m128d v;
double d[2];
} vec;
union {
__m128d v;
double d[2];
} vec2;
vec.v = index1;
vec2.v = index2;
temp1 = _mm_mul_pd(temp1, _mm_set_pd(bvec[vec.d[1]], bvec[vec2[1]]));
also, the two unions look ridiculously ugly, but when using
union dvec {
__m128d v;
double d[2];
} vec;
Trying to declare the indexX as dvec, the compiler complained dvec is undeclared.
Unfortunately if you look at MSDN it says the following:
You should not access the __m128d fields directly. You can, however, see these types in the debugger. A variable of type __m128 maps to the XMM[0-7] registers.
I'm no expert in SIMD, however this tells me that what you're doing won't work as it's just not designed to.
EDIT:
I've just found this, and it says:
Use __m128, __m128d, and __m128i only on the left-hand side of an assignment, as a return value, or as a parameter. Do not use it in other arithmetic expressions such as "+" and ">>".
It also says:
Use __m128, __m128d, and __m128i objects in aggregates, such as unions (for example, to access the float elements) and structures.
So maybe you can use them, but only in unions. Seems contradictory to what MSDN says, however.
EDIT2:
Here is another interesting resource that describes with examples on how to use these SIMD types
In the above link, you'll find this line:
#include <math.h>
#include <emmintrin.h>
double in1_min(__m128d x)
{
return x[0];
}
In the above we use a new extension in gcc 4.6 to access the high and low parts via indexing. Older versions of gcc require using a union and writing to an array of two doubles. This is cumbersome, and extra slow when optimization is turned off.
_mm_cvtsd_f64 + _mm_unpackhi_pd
For doubles:
#include <assert.h>
#include <x86intrin.h>
int main(void) {
__m128d x = _mm_set_pd(1.5, 2.5);
/* _mm_cvtsd_f64 + _mm_unpackhi_pd */
assert(_mm_cvtsd_f64(x) == 2.5);
assert(_mm_cvtsd_f64(_mm_unpackhi_pd(x, x)) == 1.5);
}
For floats, I have posted the following examples at How to convert a hex float to a float in C/C++ using _mm_extract_ps SSE GCC instrinc function
_mm_cvtss_f32 + _mm_shuffle_ps
_MM_EXTRACT_FLOAT
For ints you can use _mm_extract_epi32:
#include <assert.h>
#include <x86intrin.h>
int main(void) {
__m128i x = _mm_set_epi32(1, 2, 3, 4);
assert(_mm_extract_epi32(x, 3) == 4);
assert(_mm_extract_epi32(x, 2) == 3);
assert(_mm_extract_epi32(x, 1) == 1);
assert(_mm_extract_epi32(x, 0) == 1);
}
GitHub upstream.
Compile and run examples with:
gcc -ggdb3 -O0 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
./main.out
Tested on Ubuntu 19.04 amd64.
There is a double _mm_cvtsd_f64 (__m128d a) function in defined in "emmintrin.h" to access the lower double of an sse vector of two doubles.
From the Intel Intrinsics guide:
Synopsis
double _mm_cvtsd_f64 (__m128d a)
include "emmintrin.h"
Instruction: movsd
CPUID Feature Flag: SSE2
Description:
Copy the lower double-precision (64-bit) floating-point element of a to dst.
Operation
dst[63:0] := a[63:0]

Resources