Why does clang is unable to unroll a loop (that gcc unrolls)? - c

I am writing in C and compiling using clang. I am trying to unroll a loop. The loop is not unrolled and there is a warning.
loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
You can find the results here: https://godbolt.org/z/4flN-k
int foo(int c)
{
size_t w = 0;
size_t i = sizeof(size_t);
#pragma unroll
while(i--)
{
w = (w << 8) | c;
}
return w;
}
GCC can unroll the loop with -O3 and thus I assume that clang should also unroll it.

I do not know but it can if you use the same options:
https://godbolt.org/z/VYn0CA
The inly difference is the size of the integer

Related

How can I work around GCC optimization-miss bug 90271?

GCC versions released before May 2019 (and maybe later) fail to optimize this piece of code:
// Replace the k'th byte within an int
int replace_byte(int v1 ,char v2, size_t k)
{
memcpy( (void*) (((char*)&v1)+k) , &v2 , sizeof(v2) );
return v1;
}
as can be seen here (GodBolt): clang optimizes this code properly GCC and MSVC do not. This is GCC bug 90271, which will at some point be fixed. But - it won't be fixed for GCC versions that are out today, and I want to write this code today...
So: Is there a workaround which will make GCC generate the same code as clang for this function, or at least - code of comparable performance, keeping data in registers and not resorting to pointers and the stack?
Notes:
I marked this as C, since the code snippet is in C. I assume a workaround, if one exists, can also be implemented in C.
I'm interested in optimizing both the non-inlined function and the inlined version.
This question is related to this one, but only regards GCC and the specific approach in the piece of code here; and is in C rather than C++.
This makes the non-inlined version a little longer, but the inlined version is optimized for all three compilers:
int replace_bytes(int v1 ,char v2, size_t k)
{
return (v1 & ~(0xFF << k * 8)) | ((unsigned char)v2 << k * 8);
}
The cast of v2 to an unsigned char before the shift is necessary if char is a signed type. When that's the case, without the case v2 will be sign extended to an integer, which will cause unwanted bits to be set to 1 in the result.

vectorize sum of squared residual with gcc/clang without intrinsics

I'm trying to convince gcc (4.8.1) or clang (3.4) to vectorize the following
code on a ivy bridge processor:
#include "stdlib.h"
#include "math.h"
float sumsqr(float *v, float mean, size_t n) {
float ret = 0;
for(size_t i = 0; i < n; i++) {
ret += pow((v[i] - mean), 2);
}
return ret;
}
and compiling it without success
$ gcc -std=c99 -O3 -march=native -mtune=native -ffast-math -S foo.c
is there a way to modify the code without using instrinsics or modify gcc invocation in order to obtain vectorized code?
The pow function is very general and it may not be visible to the compiler what it does (remember that it can compute things like pow(1.8, -3.19). So it might help to use only builtin operations, and not make function calls:
for(size_t i = 0; i < n; i++)
{
float const x = v[i] - mean;
ret += x * x;
}
First, don't use pow if you don't have to, plain multiplication lets gcc vectorize. Now to explain why you are getting this behavior, notice that replacing pow with powf, gcc manages to vectorize. gcc knows that pow(x,2) is x*x, but the issue here is that pow is a function for double. So the compiler must convert the number v[i]-mean to double, compute the square as a double, add it to ret as a double, and only then convert to float. If at least ret was a double, the compiler could vectorize, but as is, all those conversions make it too complicated and not worth vectorizing.

How to overcome "existence of vector dependence" in icc

I want to vectorize following loop in C:
for(k = 0; k < SysData->numOfClaGen; k++)
A[k] = B[k] * cos(x1[2 * k] - x1[ind0 + k]);
where, there is no alias between variables and ind0 is a constant. None of the other pointers (A or B) point to ind0 and therefore, ind0 remains constant throughout the loop.
When I compile the code with icc, it says that this loop cannot be vectorized due to possible vector dependence. Here is the message:
loop was not vectorized: existence of vector dependence.
I narrowed the problem down and found out that replacing ind0 with a constant number solves the problem. So, I assume that icc thinks A may point to ind0 and therefore, ind0 may change.
I would like to know how I can help the compiler to know that it is safe to vectorized such loop.
Thanks in advance for your help.
Add #pragma ivdep in front of the for loop, it instructs the compiler to ignore assumed vector dependencies.
#pragma ivdep
for(k = 0; k < SysData->numOfClaGen; k++)
A[k] = B[k] * cos(x1[2 * k] - x1[ind0 + k]);
for more info about ivdep, see icc doc
Use of the restrict modifier for pointers asserts to the compiler that there is no aliasing. This keyword was introduced in C99. C++ does not support it, but many C++ compilers support __restrict as an equivalent proprietary extension. With the Intel compiler, one has to enable use of restrict by adding the command line flag -restrict (Linux) or /Qrestrict (Windows). In the following version of your code the loop is vectorized as desired when using Intel compiler version 13.1.3.198:
#include <math.h>
struct bar {
int numOfClaGen;
};
void foo (double * restrict A,
const double * restrict B,
const double * restrict x1,
const struct bar * restrict SysData,
const int ind0)
{
int k;
for (k = 0; k < SysData->numOfClaGen; k++) {
A[k] = B[k] * cos(x1[2 * k] - x1[ind0 + k]);
}
}
Invoking the compiler as follows (on a 64-bit Windows system)
icl /c /Ox /QxHost /Qrestrict /Qvec-report2 vectorize.c
the compiler reported
vectorize.c(14): (col. 5) remark: LOOP WAS VECTORIZED.
icc was changed a year ago to set -ansi-alias as a default for linux and Mac. For Windows, this default can't be counted on, as it conflicts with Microsoft usage. This option is equivalent to gcc -fstrict-aliasing, which has been a default since gcc 3.0. I think it's much better to set this option than to set ivdep restrict or simd for such a limited issue.
Although it's not well documented, icc treats __restrict the same as gcc and doesn't require the restrict or C99 option to accept it. In principle, it should come into play only for the objects being modified (A[] in the example above).
Strangely, __restrict has a slightly different meaning for MSVC++. It permits non-vector optimizations which might otherwise be prevented by possible dependencies, but doesn't enable vectorization (but it might apply to the present case).

Better way to do predicate assignment in C?

What I'm trying to do is avoid the following:
if(*ptr > 128) {
number = 5;
}
Such code performs poorly when there's no clear pattern as to which way the branch will go. What I came up with is this:
int arr[] = { number, 5 };
int cond = *ptr > 128;
number = arr[cond];
Based on my testing, that runs more than twice as fast as doing the conditional when the input is random. What I'm wondering is if there's a more clever way to do this, perhaps using bitwise operators.
A clever compiler should definitely compile this to a conditional move with the right optimization settings; check the disassembly to be sure.
There is this branchless solution:
int mask = -(*ptr > 128);
number = (number & mask) | (5 & ~mask);
The last line can also be
number = ((mask & (number ^ 5)) ^ 5);
if you're looking to use one less operation. But, caveat emptor, the compiler won't be able to optimize either of these nearly as well. You are best leaving this particular optimization for the compiler to worry about, unless you specifically know that the compiler is unable to make the optimization (in that case, you may want to check your compiler version or flags).

How to store the contents of a __m128d simd vector as doubles without accessing it as a union?

The code i want to optimize is basically a simple but large arithmetic formula, it should be fairly simple to analyze the code automatically to compute the independent multiplications/additions in parallel, but i read that autovectorization only works for loops.
I've read multiple times now that access of single elements in a vector via union or some other way should be avoided at all costs, instead should be replaced by a _mm_shuffle_pd (i'm working on doubles only)...
I don't seem to figure out how I can store the content of a __m128d vector as doubles without accessing it as a union. Also, does an operation like this give any performance gain when compared to scalar code?
union {
__m128d v;
double d[2];
} vec;
union {
__m128d v;
double d[2];
} vec2;
vec.v = index1;
vec2.v = index2;
temp1 = _mm_mul_pd(temp1, _mm_set_pd(bvec[vec.d[1]], bvec[vec2[1]]));
also, the two unions look ridiculously ugly, but when using
union dvec {
__m128d v;
double d[2];
} vec;
Trying to declare the indexX as dvec, the compiler complained dvec is undeclared.
Unfortunately if you look at MSDN it says the following:
You should not access the __m128d fields directly. You can, however, see these types in the debugger. A variable of type __m128 maps to the XMM[0-7] registers.
I'm no expert in SIMD, however this tells me that what you're doing won't work as it's just not designed to.
EDIT:
I've just found this, and it says:
Use __m128, __m128d, and __m128i only on the left-hand side of an assignment, as a return value, or as a parameter. Do not use it in other arithmetic expressions such as "+" and ">>".
It also says:
Use __m128, __m128d, and __m128i objects in aggregates, such as unions (for example, to access the float elements) and structures.
So maybe you can use them, but only in unions. Seems contradictory to what MSDN says, however.
EDIT2:
Here is another interesting resource that describes with examples on how to use these SIMD types
In the above link, you'll find this line:
#include <math.h>
#include <emmintrin.h>
double in1_min(__m128d x)
{
return x[0];
}
In the above we use a new extension in gcc 4.6 to access the high and low parts via indexing. Older versions of gcc require using a union and writing to an array of two doubles. This is cumbersome, and extra slow when optimization is turned off.
_mm_cvtsd_f64 + _mm_unpackhi_pd
For doubles:
#include <assert.h>
#include <x86intrin.h>
int main(void) {
__m128d x = _mm_set_pd(1.5, 2.5);
/* _mm_cvtsd_f64 + _mm_unpackhi_pd */
assert(_mm_cvtsd_f64(x) == 2.5);
assert(_mm_cvtsd_f64(_mm_unpackhi_pd(x, x)) == 1.5);
}
For floats, I have posted the following examples at How to convert a hex float to a float in C/C++ using _mm_extract_ps SSE GCC instrinc function
_mm_cvtss_f32 + _mm_shuffle_ps
_MM_EXTRACT_FLOAT
For ints you can use _mm_extract_epi32:
#include <assert.h>
#include <x86intrin.h>
int main(void) {
__m128i x = _mm_set_epi32(1, 2, 3, 4);
assert(_mm_extract_epi32(x, 3) == 4);
assert(_mm_extract_epi32(x, 2) == 3);
assert(_mm_extract_epi32(x, 1) == 1);
assert(_mm_extract_epi32(x, 0) == 1);
}
GitHub upstream.
Compile and run examples with:
gcc -ggdb3 -O0 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
./main.out
Tested on Ubuntu 19.04 amd64.
There is a double _mm_cvtsd_f64 (__m128d a) function in defined in "emmintrin.h" to access the lower double of an sse vector of two doubles.
From the Intel Intrinsics guide:
Synopsis
double _mm_cvtsd_f64 (__m128d a)
include "emmintrin.h"
Instruction: movsd
CPUID Feature Flag: SSE2
Description:
Copy the lower double-precision (64-bit) floating-point element of a to dst.
Operation
dst[63:0] := a[63:0]

Resources