Why is this brightness adjustment not getting vectorized in Clang? - c

I tried compiling this code,
#include <stdlib.h>
struct rgb {
int r, g, b;
};
void adjust_brightness(struct rgb *picdata, size_t len, int adjustment) {
// assume adjustment is between 0 and 255.
for (int i = 0; i < len; i++) {
picdata[i].r += adjustment;
picdata[i].g += adjustment;
picdata[i].b += adjustment;
}
}
on OSX using this command,
$ cc -Rpass-analysis=loop-vectorize -c -std=c99 -O3 brightness.c
brightness.c:13:3: remark: loop not vectorized: unsafe dependent memory operations in loop [-Rpass-analysis=loop-vectorize]
for (int i = 0; i < len; i++) {
^
Can someone explain what is unsafe and dependent here? I'm learning about SIMD, and this was explained at the most obvious use for SIMD. I was hoping to learn how the compiler would generated SIMD instructions for a simple example. In my head, I expect the compiler to maybe instead of incrementing by 1, it would increment by enough to put the loop body into vector registers?
Do I misunderstand?

Related

Tell GCC to avoid unrolling the outer loop [duplicate]

I have the following 4x4 matrix-vector multiply code:
double const __restrict__ a[16];
double const __restrict__ x[4];
double __restrict__ y[4];
//#pragma GCC unroll 1 - does not work either
#pragma GCC nounroll
for ( int j = 0; j < 4; ++j )
{
double const* __restrict__ aj = a + j * 4;
double const xj = x[j];
#pragma GCC ivdep
for ( int i = 0; i < 4; ++i )
{
y[i] += aj[i] * xj;
}
}
I compile with -O3 -mavx flags. The inner loop is vectorized (single FMAD). However, gcc (7.2) keeps unrolling the outer loop 4 times, unless I use -O2 or lower optimization.
Is there a way to override -O3 unrolling of a particular loop?
NB. Similar #pragma nounroll works if I use Intel icc.
According to the documentation, #pragma GCC unroll 1 is supposed to work, if you place it just so. If it doesn't then you should submit a bug report.
Alternatively, you can use a function attribute to set optimizations, I think:
void myfn () __attribute__((optimize("no-unroll-loops")));
For concise functions
sans full and partial loop unrolling
when required
the following function attribute
please try.
__attribute__((optimize("Os")))

How to use AVX instructions to optimize ReLU written in C

I have the following simplified ReLU simulation code that I am trying to optimize. The code uses a ternary operation which is perhaps coming in the way of automatic vectorization by the compiler. How can I vectorize this code?
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <mkl.h>
void relu(double* &Amem, double* Z_curr, int bo)
{
for (int i=0; i<bo; ++i) {
Amem[i] = Z_curr[i] > 0 ? Z_curr[i] : Z_curr[i]*0.1;
}
}
int main()
{
int i, j;
int batch_size = 16384;
int output_dim = 21;
// double* Amem = new double[batch_size*output_dim];
// double* Z_curr = new double[batch_size*output_dim];
double* Amem = (double *)mkl_malloc(batch_size*output_dim*sizeof( double ), 64 );
double* Z_curr = (double *)mkl_malloc(batch_size*output_dim*sizeof( double ), 64 );
memset(Amem, 0, sizeof(double)*batch_size*output_dim);
for (i=0; i<batch_size*output_dim; ++i) {
Z_curr[i] = -1+2*((double)rand())/RAND_MAX;
}
relu(Amem, Z_curr, batch_size*output_dim);
}
To compile it, if you have MKL then use the following, otherwise plain g++ -O3.
g++ -O3 ex.cxx -L${MKLROOT}/lib/intel64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5
So far, I have tried adding -march=skylake-avx512 as a compiler option, but it does not vectorize the loop as I found using option -fopt-info-vec-all for compilation:
ex.cxx:9:16: missed: couldn't vectorize loop
ex.cxx:9:16: missed: not vectorized: control flow in loop.
ex.cxx:6:6: note: vectorized 0 loops in function.
ex.cxx:9:16: missed: couldn't vectorize loop
ex.cxx:9:16: missed: not vectorized: control flow in loop.
and this is the time it takes currently at my end:
time ./a.out
real 0m0.034s
user 0m0.026s
sys 0m0.009s
There is usually no benefit to pass a pointer by reference (unless you want to modify the pointer itself). Furthermore, you can help your compiler using the (non-standard) __restrict keyword, telling it that no aliasing happens between input and output (of course, this will likely give wrong results, if e.g., Amem == Z_curr+1 -- but Amem == Z_curr should (in this case) be fine).
void relu(double* __restrict Amem, double* Z_curr, int bo)
Using that alone, clang actually is capable of vectorizing your loop using vcmpltpd and masked moves (for some reasons, only using 256bit registers).
If you simplify your expression to std::max(Z_curr[i], 0.1*Z_curr[i]) even gcc easily is capable of vectorizing it: https://godbolt.org/z/eTv4PnMWb
Generally, I would suggest compiling crucial routines of your code with different compilers and different compile options (sometimes trying -ffast-math can show you ways to simplify your expressions) and have a look at the generated code. For portability you could then re-translate the generated code into intrinsics (or leave it as is, if every compiler you care about gives good-enough results).
For completeness, here is a possible manually vectorized implementation using intrinsics:
void relu_avx512(double* __restrict Amem, double* Z_curr, int bo)
{
int i;
for (i=0; i<=bo-8; i+=8)
{
__m512d z = _mm512_loadu_pd(Z_curr+i);
__mmask8 positive = _mm512_cmplt_pd_mask (_mm512_setzero_pd(), z);
__m512d res = _mm512_mask_mul_pd(z, positive, z, _mm512_set1_pd(0.9));
_mm512_storeu_pd(Amem+i, res);
}
// remaining elements in scalar loop
for (; i<bo; ++i) {
Amem[i] = 0.0 < Z_curr[i] ? Z_curr[i] : Z_curr[i]*0.1;;
}
}
Godbolt: https://godbolt.org/z/s6br5KEEc (if you compile this with -O2 or -O3 on clang, it will heavily unroll the cleanup loop, even though it can't have more than 7 iterations. Theoretically, you could do the remaining elements with a masked or overlapping store (or maybe you have use-cases where the size is guaranteed to be a multiple of 8 and you can leave it away).

Why compiler is not able to optimize read from TLS?

Consider follwing header file, "tls.h":
#include <stdint.h>
// calling this function is expensive
uint64_t foo(uint64_t x);
extern __thread uint64_t cache;
static inline uint64_t
get(uint64_t x)
{
// if cache is not valid
if (cache == UINT64_MAX)
cache = foo(x);
return cache + x;
}
and source file "tls.c":
#include "tls.h"
__thread uint64_t cache = {0};
uint64_t foo(uint64_t x)
{
// imagine some calculations are performed here
return 0;
}
Below is example usage of get() function in "main.c":
#include "tls.h"
uint64_t t = 0;
int main()
{
uint64_t x = 0;
for(uint64_t i = 0; i < 1024UL * 1024 * 1024; i++){
t += get(i);
x++;
}
}
Presented files are compiled as following:
gcc -c -O3 tls.c
gcc -c -O3 main.c
gcc -O3 main.o tls.o
Examining the performance of loop in "main.c" revealed that compiler optimization is very poor. After disassembling the binary, it is clear that tls is being accessed in every iteration.
Execution time on my machine is 1.7s.
However, if I remove check for cache validity in get() method so that it looks like this:
static inline uint64_t
get(uint64_t x)
{
return cache + x;
}
the compiler is now able to create much faster code - it completely removes the loop and generates only only one "add" instruction. Execution time is ~0.02s.
Why the compiler is not able to optimize the first case? TLS variable cannot be changed by other threads so compiler should be able to optimize this, right?
Is there any other way I can optimize the get() function?

gcc auto vectorization control flow in loop

In the code below, why is the second loop able to be auto vectorized but the first cannot? How can I modify the code so it does auto vectorize? gcc says:
note: not vectorized: control flow in loop.
I am using gcc 8.2, flags are -O3 -fopt-info-vec-all. I am compiling for x86-64 avx2.
#include <stdlib.h>
#include <math.h>
void foo(const float * x, const float * y, const int * v, float * vec, float * novec, size_t size) {
size_t i;
float bar;
for (i=0 ; i<size ; ++i){
bar = x[i] - y[i];
novec[i] = v[i] ? bar : NAN;
}
for (i=0 ; i<size ; ++i){
bar = x[i];
vec[i] = v[i] ? bar : NAN;
}
}
Update:
This does autovectorize:
for (i=0 ; i<size ; ++i){
bar = x[i];
novec[i] = v[i] ? bar : NAN;
novec[i] -= y[i];
}
I would still like to know why gcc says control flow for the first loop.
clang auto-vectorizes even the first loop, but gcc8.2 doesn't. (https://godbolt.org/z/cnlwuO)
gcc vectorizes with -ffast-math. Perhaps it's worried about preserving FP exception flag status from the subtraction?
-fno-trapping-math is sufficient for gcc to auto-vectorize (without the rest of what -ffast-math sets), so apparently it's worried about FP exceptions. (https://godbolt.org/z/804ykV). I think it's being over-cautious, because the C source does compute bar every time, whether or not it's used.
gcc will auto-vectorize simple FP a[i] = b[i]+c[i] loops without any FP math options.

How to get SIMD code from C code

I am working on a m/c Intel(R) Xeon(R) CPU E5-2640 v2 # 2.00GHz It supports SSE4.2.
I have written C code to perform XOR operation over string bits. But I want to write corresponding SIMD code and check for performance improvement. Here is my code
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#define LENGTH 10
unsigned char xor_val[LENGTH];
void oper_xor(unsigned char *r1, unsigned char *r2)
{
unsigned int i;
for (i = 0; i < LENGTH; ++i)
{
xor_val[i] = (unsigned char)(r1[i] ^ r2[i]);
printf("%d",xor_val[i]);
}
}
int main() {
int i;
time_t start, stop;
double cur_time;
start = clock();
oper_xor("1110001111", "0000110011");
stop = clock();
cur_time = ((double) stop-start) / CLOCKS_PER_SEC;
printf("Time used %f seconds.\n", cur_time / 100);
for (i = 0; i < LENGTH; ++i)
printf("%d",xor_val[i]);
printf("\n");
return 0;
}
On compiling and running a sample code I am getting output shown below. Time is 00 here but in actual project it is consuming sufficient time.
gcc xor_scalar.c -o xor_scalar
pan88: ./xor_scalar
1110111100 Time used 0.000000 seconds.
1110111100
How can I start writing a corresponding SIMD code for SSE4.2
The Intel Compiler and any OpenMP compiler support #pragma simd and #pragma omp simd, respectively. These are your best bet to get the compiler to do SIMD codegen for you. If that fails, you can use intrinsics or, as a means of last resort, inline assembly.
Note the the printf function calls will almost certainly interfere with vectorization, so you should remove them from any loops in which you want to see SIMD.

Resources