vectorize sum of squared residual with gcc/clang without intrinsics

vectorize sum of squared residual with gcc/clang without intrinsics - c

I'm trying to convince gcc (4.8.1) or clang (3.4) to vectorize the following
code on a ivy bridge processor:
#include "stdlib.h"
#include "math.h"
float sumsqr(float *v, float mean, size_t n) {
float ret = 0;
for(size_t i = 0; i < n; i++) {
ret += pow((v[i] - mean), 2);
}
return ret;
}
and compiling it without success
$ gcc -std=c99 -O3 -march=native -mtune=native -ffast-math -S foo.c
is there a way to modify the code without using instrinsics or modify gcc invocation in order to obtain vectorized code?

The pow function is very general and it may not be visible to the compiler what it does (remember that it can compute things like pow(1.8, -3.19). So it might help to use only builtin operations, and not make function calls:
for(size_t i = 0; i < n; i++)
{
float const x = v[i] - mean;
ret += x * x;
}

First, don't use pow if you don't have to, plain multiplication lets gcc vectorize. Now to explain why you are getting this behavior, notice that replacing pow with powf, gcc manages to vectorize. gcc knows that pow(x,2) is x*x, but the issue here is that pow is a function for double. So the compiler must convert the number v[i]-mean to double, compute the square as a double, add it to ret as a double, and only then convert to float. If at least ret was a double, the compiler could vectorize, but as is, all those conversions make it too complicated and not worth vectorizing.

Related

Why does clang is unable to unroll a loop (that gcc unrolls)?

I am writing in C and compiling using clang. I am trying to unroll a loop. The loop is not unrolled and there is a warning.
loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
You can find the results here: https://godbolt.org/z/4flN-k
int foo(int c)
{
size_t w = 0;
size_t i = sizeof(size_t);
#pragma unroll
while(i--)
{
w = (w << 8) | c;
}
return w;
}
GCC can unroll the loop with -O3 and thus I assume that clang should also unroll it.

I do not know but it can if you use the same options:
https://godbolt.org/z/VYn0CA
The inly difference is the size of the integer

Is fmod faster than % for integer modulus calculation

Just found the following line in some old src code:
int e = (int)fmod(matrix[i], n);
where matrix is an array of int, and n is a size_t
I'm wondering why the use of fmod rather than % where we have integer arguments, i.e. why not:
int e = (matrix[i]) % n;
Could there possibly be a performance reason for choosing fmod over % or is it just a strange bit of code?

Could there possibly be a performance reason for choosing fmod over %
or is it just a strange bit of code?
The fmod might be a bit faster on architectures with high-latency IDIV instruction, that takes (say) ~50 cycles or more, so fmod's function call and int <---> doubleconversions cost can be amortized.
According to Agner's Fog instruction tables, IDIV on AMD K10 architecture takes 24-55 cycles. Comparing with modern Intel Haswell, its latency range is listed as 22-29 cycles, however if there are no dependency chains, the reciprocal throughput is much better on Intel, 8-11 clock cycles.

fmod might be a tiny bit faster than the integer division on selected architectures.
Note however that if n has a known non zero value at compile time, matrix[i] % n would be compiled as a multiplication with a small adjustment, which should be much faster than both the integer modulus and the floating point modulus.
Another interesting difference is the behavior on n == 0 and INT_MIN % -1. The integer modulus operation invokes undefined behavior on overflow which results in abnormal program termination on many current architectures. Conversely, the floating point modulus does not have these corner cases, the result is +Infinity, -Infinity, Nan depending on the value of matrix[i] and -INT_MIN, all exceeding the range of int and the conversion back to int is implementation defined, but does not usually cause abnormal program termination. This might be the reason for the original programmer to have chosen this surprising solution.

Experimentally (and quite counter-intuitively), fmod is faster than % - at least on AMD Phenom(tm) II X4 955 with 6400 bogomips. Here are two programs that use either of the techniques, both compiled with the same compiler (GCC) and the same options (cc -O3 foo.c -lm), and ran on the same hardware:
#include <math.h>
#include <stdio.h>
int main()
{
int volatile a=10,b=12;
int i, sum = 0;
for (i = 0; i < 1000000000; i++)
sum += a % b;
printf("%d\n", sum);
return 0;
}
Running time: 9.07 sec.
#include <math.h>
#include <stdio.h>
int main()
{
int volatile a=10,b=12;
int i, sum = 0;
for (i = 0; i < 1000000000; i++)
sum += (int)fmod(a, b);
printf("%d\n", sum);
return 0;
}
Running time: 8.04 sec.

uint16_t subtraction GCC compilation error

I have the following program
#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>
int main(void) {
uint16_t o = 100;
uint32_t i1 = 30;
uint32_t i2 = 20;
o = (uint16_t) (o - (i1 - i2)); /*Case A*/
o -= (uint16_t) (i1 - i2); /*Case B*/
(void)o;
return 0;
}
Case A compiles with no errors.
Case B causes the following error
[error: conversion to ‘uint16_t’ from ‘int’ may alter its value [-Werror=conversion]]
The warning options I'm using are:
-Werror -Werror=strict-prototypes -pedantic-errors -Wconversion -pedantic -Wall -Wextra -Wno-unused-function
I'm using GCC 4.9.2 on Ubuntu 15.04 64-bits.
Why do I get this error in Case B but not in Case A?
PS:
I ran the same example with clang compiler and both cases are compiled fine.

Integer Promotion is a strange thing. Basically, all integer values, of any smaller size, are promoted to int so they can be operated on efficiently, and then converted back to the smaller size when stored. This is mandated by the C standard.
So, Case A really looks like this:
o = (uint16_t) ((int)o - ((uint32_t)i1 - (uint32_t)i2));
(Note that uint32_t does not fit in int, so needs no promotion.)
And, Case B really looks like this:
o = (int)o - (int)(uint16_t) ((uint32_t)i1 - (uint32_t)i2);
The main difference is that Case A has an explicit cast, whereas Case B has an implicit conversion.
From the GCC manual:
-Wconversion
Warn for implicit conversions that may alter a value. ....
So, only Case B gets a warning.

Your case B is equivalent to:
o = o - (uint16_t) (i1 - i2); /*Case B*/
The result is an int which may not fit in uint16_t, so, per your extreme warning options, it produces a warning (and thus an error since you're treating warnings as errors).

How to overcome "existence of vector dependence" in icc

I want to vectorize following loop in C:
for(k = 0; k < SysData->numOfClaGen; k++)
A[k] = B[k] * cos(x1[2 * k] - x1[ind0 + k]);
where, there is no alias between variables and ind0 is a constant. None of the other pointers (A or B) point to ind0 and therefore, ind0 remains constant throughout the loop.
When I compile the code with icc, it says that this loop cannot be vectorized due to possible vector dependence. Here is the message:
loop was not vectorized: existence of vector dependence.
I narrowed the problem down and found out that replacing ind0 with a constant number solves the problem. So, I assume that icc thinks A may point to ind0 and therefore, ind0 may change.
I would like to know how I can help the compiler to know that it is safe to vectorized such loop.
Thanks in advance for your help.

Add #pragma ivdep in front of the for loop, it instructs the compiler to ignore assumed vector dependencies.
#pragma ivdep
for(k = 0; k < SysData->numOfClaGen; k++)
A[k] = B[k] * cos(x1[2 * k] - x1[ind0 + k]);
for more info about ivdep, see icc doc

Use of the restrict modifier for pointers asserts to the compiler that there is no aliasing. This keyword was introduced in C99. C++ does not support it, but many C++ compilers support __restrict as an equivalent proprietary extension. With the Intel compiler, one has to enable use of restrict by adding the command line flag -restrict (Linux) or /Qrestrict (Windows). In the following version of your code the loop is vectorized as desired when using Intel compiler version 13.1.3.198:
#include <math.h>
struct bar {
int numOfClaGen;
};
void foo (double * restrict A,
const double * restrict B,
const double * restrict x1,
const struct bar * restrict SysData,
const int ind0)
{
int k;
for (k = 0; k < SysData->numOfClaGen; k++) {
A[k] = B[k] * cos(x1[2 * k] - x1[ind0 + k]);
}
}
Invoking the compiler as follows (on a 64-bit Windows system)
icl /c /Ox /QxHost /Qrestrict /Qvec-report2 vectorize.c
the compiler reported
vectorize.c(14): (col. 5) remark: LOOP WAS VECTORIZED.

icc was changed a year ago to set -ansi-alias as a default for linux and Mac. For Windows, this default can't be counted on, as it conflicts with Microsoft usage. This option is equivalent to gcc -fstrict-aliasing, which has been a default since gcc 3.0. I think it's much better to set this option than to set ivdep restrict or simd for such a limited issue.
Although it's not well documented, icc treats __restrict the same as gcc and doesn't require the restrict or C99 option to accept it. In principle, it should come into play only for the objects being modified (A[] in the example above).
Strangely, __restrict has a slightly different meaning for MSVC++. It permits non-vector optimizations which might otherwise be prevented by possible dependencies, but doesn't enable vectorization (but it might apply to the present case).

How to store the contents of a __m128d simd vector as doubles without accessing it as a union?

The code i want to optimize is basically a simple but large arithmetic formula, it should be fairly simple to analyze the code automatically to compute the independent multiplications/additions in parallel, but i read that autovectorization only works for loops.
I've read multiple times now that access of single elements in a vector via union or some other way should be avoided at all costs, instead should be replaced by a _mm_shuffle_pd (i'm working on doubles only)...
I don't seem to figure out how I can store the content of a __m128d vector as doubles without accessing it as a union. Also, does an operation like this give any performance gain when compared to scalar code?
union {
__m128d v;
double d[2];
} vec;
union {
__m128d v;
double d[2];
} vec2;
vec.v = index1;
vec2.v = index2;
temp1 = _mm_mul_pd(temp1, _mm_set_pd(bvec[vec.d[1]], bvec[vec2[1]]));
also, the two unions look ridiculously ugly, but when using
union dvec {
__m128d v;
double d[2];
} vec;
Trying to declare the indexX as dvec, the compiler complained dvec is undeclared.

Unfortunately if you look at MSDN it says the following:
You should not access the __m128d fields directly. You can, however, see these types in the debugger. A variable of type __m128 maps to the XMM[0-7] registers.
I'm no expert in SIMD, however this tells me that what you're doing won't work as it's just not designed to.
EDIT:
I've just found this, and it says:
Use __m128, __m128d, and __m128i only on the left-hand side of an assignment, as a return value, or as a parameter. Do not use it in other arithmetic expressions such as "+" and ">>".
It also says:
Use __m128, __m128d, and __m128i objects in aggregates, such as unions (for example, to access the float elements) and structures.
So maybe you can use them, but only in unions. Seems contradictory to what MSDN says, however.
EDIT2:
Here is another interesting resource that describes with examples on how to use these SIMD types
In the above link, you'll find this line:
#include <math.h>
#include <emmintrin.h>
double in1_min(__m128d x)
{
return x[0];
}
In the above we use a new extension in gcc 4.6 to access the high and low parts via indexing. Older versions of gcc require using a union and writing to an array of two doubles. This is cumbersome, and extra slow when optimization is turned off.

_mm_cvtsd_f64 + _mm_unpackhi_pd
For doubles:
#include <assert.h>
#include <x86intrin.h>
int main(void) {
__m128d x = _mm_set_pd(1.5, 2.5);
/* _mm_cvtsd_f64 + _mm_unpackhi_pd */
assert(_mm_cvtsd_f64(x) == 2.5);
assert(_mm_cvtsd_f64(_mm_unpackhi_pd(x, x)) == 1.5);
}
For floats, I have posted the following examples at How to convert a hex float to a float in C/C++ using _mm_extract_ps SSE GCC instrinc function
_mm_cvtss_f32 + _mm_shuffle_ps
_MM_EXTRACT_FLOAT
For ints you can use _mm_extract_epi32:
#include <assert.h>
#include <x86intrin.h>
int main(void) {
__m128i x = _mm_set_epi32(1, 2, 3, 4);
assert(_mm_extract_epi32(x, 3) == 4);
assert(_mm_extract_epi32(x, 2) == 3);
assert(_mm_extract_epi32(x, 1) == 1);
assert(_mm_extract_epi32(x, 0) == 1);
}
GitHub upstream.
Compile and run examples with:
gcc -ggdb3 -O0 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
./main.out
Tested on Ubuntu 19.04 amd64.

There is a double _mm_cvtsd_f64 (__m128d a) function in defined in "emmintrin.h" to access the lower double of an sse vector of two doubles.
From the Intel Intrinsics guide:
Synopsis
double _mm_cvtsd_f64 (__m128d a)
include "emmintrin.h"
Instruction: movsd
CPUID Feature Flag: SSE2
Description:
Copy the lower double-precision (64-bit) floating-point element of a to dst.
Operation
dst[63:0] := a[63:0]

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

vectorize sum of squared residual with gcc/clang without intrinsics - c

Related

Why does clang is unable to unroll a loop (that gcc unrolls)?

Is fmod faster than % for integer modulus calculation

uint16_t subtraction GCC compilation error

How to overcome "existence of vector dependence" in icc

How to store the contents of a __m128d simd vector as doubles without accessing it as a union?

Categories

Resources