The gcc flag -funsafe-math-optimizations (part of -ffast-math) turns on FTZ and DAZ (flush-to-zero and denormals-are-zero). However, turning on optimization disables this behavior.
#include <stdio.h>
#include <pmmintrin.h>
int main(int argc, char** argv)
{
//_MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);
float normal_f = 1.18e-38f;
double normal_d = 2.23e-308;
normal_f *= 0.1f;
if (normal_f != 0.0f)
printf("FTZ/DAZ disabled for floats (%e)\n", (double) normal_f);
normal_d *= 0.1;
if (normal_d != 0.0)
printf("FTZ/DAZ disabled for doubles (%e)\n", normal_d);
return 0;
}
When compiled with gcc foo.c -ffast-math both FTZ and DAZ are enabled (i.e., no output to stdout). However, if including any optimization (e.g., -O1, -O3, -Ofast) then FTZ and DAZ are disabled:
$ ./a.out
FTZ/DAZ disabled for floats (1.180000e-39)
FTZ/DAZ disabled for doubles (2.230000e-309)
Even stranger, I see the same behavior when I explicitly enable FTZ and DAZ with _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON). Optimization disables it. This command only changes anything when compiling without -ffast-math.
My question: how do I achieve FTZ/DAZ when using optimization? Also, are there other -ffast-math behaviors that are disabled with optimization?
I have observed this in both gcc 10.2 and 6.3
It looks like this behavior is an artifact of my unit tests. The compiler itself doesn't obey FTZ/DAZ, only the generated code does. As a result, in a situation like exampled here the compiler performs calculations and knows that it's safe to bypass the conditional statements and move straight to the printfs.
Breaking this into 2 compile units, the problem disappears:
bar.c
#include <stdio.h>
void check_normalf(float f)
{
if (f != 0.0f) {
printf("FTZ/DAZ disabled for floats (%e)\n", (double) f);
}
}
void check_normal(float d)
{
if (d != 0.0) {
printf("FTZ/DAZ disabled for doubles (%e)\n", d);
}
}
foo.c
#include <pmmintrin.h>
void check_normalf(float f);
void check_normal(float d);
int main(int argc, char** argv)
{
//_MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);
float normal_f = 1.18e-38f;
double normal_d = 2.23e-308;
check_normalf(normal_f * 0.1f);
check_normal(normal_d * 0.1);
return 0;
}
The problem persists when the check functions are part of the same file, but that's again the compiler optimizations.
Related
This question already has answers here:
Where is Clang's '_mm256_pow_ps' intrinsic?
(1 answer)
Fastest Implementation of Exponential Function Using AVX
(5 answers)
Closed 16 days ago.
I have an old implementation that used the _mm256_exp_ps() function, and I could compile them with GCC, ICC, and Clang; Now, I cannot compile the code anymore because the compiler does not find the function _mm256_exp_ps().
Here is the simplified version of my problem:
#include <stdio.h>
#include <x86intrin.h>
int main()
{
__m256 vec1, vec2;
vec2 = _mm256_exp_ps(vec1);
return 0;
}
And the error is:
$ gcc -march=native temp.c -o temp
temp.c: In function ‘main’:
temp.c:9:16: warning: implicit declaration of function ‘_mm256_exp_ps’; did you mean ‘_mm256_rcp_ps’? [-Wimplicit-function-declaration]
9 | vec2 = _mm256_exp_ps(vec1);
| ^~~~~~~~~~~~~
| _mm256_rcp_ps
temp.c:9:16: error: incompatible types when assigning to type ‘__m256’ from type ‘int’
Which means the compiler cannot find the intrinsic.
If I use another function, for example, _mm256_add_ps(), there are no errors, which means the library is accessible; the problem is with _mm256_exp_ps() that might have been changed when they have added AVX512 support to the compiler.
#include <stdio.h>
#include <x86intrin.h>
int main()
{
__m256 vec1, vec2;
vec2 = _mm256_add_ps(vec1, vec2);
return 0;
}
Could you please help me solve the problem?
As a workaround, which should hopefully allow you to compile and run your program, you could include a function yourself with the same name. If it is not a performance critical part of your program, it might be an acceptable fix. Below are SSE and AVX versions of the function.
#include <stdio.h>
#include <math.h>
#include <immintrin.h>
#include <xmmintrin.h>
// gcc Junk.c -o Junk.bin -mavx -lm
// gcc Junk.c -o Junk.bin -msse4 -lm
__m128 _mm128_exp_ps(__m128 invec) {
float *element = (float *)&invec;
return _mm_setr_ps(
expf(element[0]),
expf(element[1]),
expf(element[2]),
expf(element[3])
);
}
/*
__m256 _mm256_exp_ps(__m256 invec) {
float *element = (float *)&invec;
return _mm256_setr_ps(
expf(element[0]),
expf(element[1]),
expf(element[2]),
expf(element[3]),
expf(element[4]),
expf(element[5]),
expf(element[6]),
expf(element[7])
);
}
*/
int main()
{
__m128 vec1, vec2;
vec1 = _mm_setr_ps( 1.0, 1.1, 1.2, 1.3);
vec2 = _mm128_exp_ps(vec1);
float *element = (float *)&vec2;
int i;
for (i=0; i<4; i++) {
printf("%f %f\n", element[i], expf(1.0f + i/10.0f));
}
return 0;
}
EDIT:-
After comments by Peter Cordes about possible undefined behaviour when setting a float pointer to a _mm128 or _mm256 variable, I thought I'd add a suggestion for maximum safety and portability taken from the suggestions in the links he provided. I don't know for sure that there is a problem with the above code due to alignment issues, but it appears that the more correct way to do this would be to replace the line
float *element = (float *)&invec;
with
float element[4];
_mm_storeu_ps(element, invec);
and
float element[8];
_mm256_storeu_ps(element, invec);
for the SSE and AVX functions respectively.
I have the following simplified ReLU simulation code that I am trying to optimize. The code uses a ternary operation which is perhaps coming in the way of automatic vectorization by the compiler. How can I vectorize this code?
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <mkl.h>
void relu(double* &Amem, double* Z_curr, int bo)
{
for (int i=0; i<bo; ++i) {
Amem[i] = Z_curr[i] > 0 ? Z_curr[i] : Z_curr[i]*0.1;
}
}
int main()
{
int i, j;
int batch_size = 16384;
int output_dim = 21;
// double* Amem = new double[batch_size*output_dim];
// double* Z_curr = new double[batch_size*output_dim];
double* Amem = (double *)mkl_malloc(batch_size*output_dim*sizeof( double ), 64 );
double* Z_curr = (double *)mkl_malloc(batch_size*output_dim*sizeof( double ), 64 );
memset(Amem, 0, sizeof(double)*batch_size*output_dim);
for (i=0; i<batch_size*output_dim; ++i) {
Z_curr[i] = -1+2*((double)rand())/RAND_MAX;
}
relu(Amem, Z_curr, batch_size*output_dim);
}
To compile it, if you have MKL then use the following, otherwise plain g++ -O3.
g++ -O3 ex.cxx -L${MKLROOT}/lib/intel64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5
So far, I have tried adding -march=skylake-avx512 as a compiler option, but it does not vectorize the loop as I found using option -fopt-info-vec-all for compilation:
ex.cxx:9:16: missed: couldn't vectorize loop
ex.cxx:9:16: missed: not vectorized: control flow in loop.
ex.cxx:6:6: note: vectorized 0 loops in function.
ex.cxx:9:16: missed: couldn't vectorize loop
ex.cxx:9:16: missed: not vectorized: control flow in loop.
and this is the time it takes currently at my end:
time ./a.out
real 0m0.034s
user 0m0.026s
sys 0m0.009s
There is usually no benefit to pass a pointer by reference (unless you want to modify the pointer itself). Furthermore, you can help your compiler using the (non-standard) __restrict keyword, telling it that no aliasing happens between input and output (of course, this will likely give wrong results, if e.g., Amem == Z_curr+1 -- but Amem == Z_curr should (in this case) be fine).
void relu(double* __restrict Amem, double* Z_curr, int bo)
Using that alone, clang actually is capable of vectorizing your loop using vcmpltpd and masked moves (for some reasons, only using 256bit registers).
If you simplify your expression to std::max(Z_curr[i], 0.1*Z_curr[i]) even gcc easily is capable of vectorizing it: https://godbolt.org/z/eTv4PnMWb
Generally, I would suggest compiling crucial routines of your code with different compilers and different compile options (sometimes trying -ffast-math can show you ways to simplify your expressions) and have a look at the generated code. For portability you could then re-translate the generated code into intrinsics (or leave it as is, if every compiler you care about gives good-enough results).
For completeness, here is a possible manually vectorized implementation using intrinsics:
void relu_avx512(double* __restrict Amem, double* Z_curr, int bo)
{
int i;
for (i=0; i<=bo-8; i+=8)
{
__m512d z = _mm512_loadu_pd(Z_curr+i);
__mmask8 positive = _mm512_cmplt_pd_mask (_mm512_setzero_pd(), z);
__m512d res = _mm512_mask_mul_pd(z, positive, z, _mm512_set1_pd(0.9));
_mm512_storeu_pd(Amem+i, res);
}
// remaining elements in scalar loop
for (; i<bo; ++i) {
Amem[i] = 0.0 < Z_curr[i] ? Z_curr[i] : Z_curr[i]*0.1;;
}
}
Godbolt: https://godbolt.org/z/s6br5KEEc (if you compile this with -O2 or -O3 on clang, it will heavily unroll the cleanup loop, even though it can't have more than 7 iterations. Theoretically, you could do the remaining elements with a masked or overlapping store (or maybe you have use-cases where the size is guaranteed to be a multiple of 8 and you can leave it away).
I am trying to understand pure functions, and have been reading through the Wikipedia article on that topic. I wrote the minimal sample program as follows:
#include <stdio.h>
static int a = 1;
static __attribute__((pure)) int pure_function(int x, int y)
{
return x + y;
}
static __attribute__((pure)) int impure_function(int x, int y)
{
a++;
return x + y;
}
int main(void)
{
printf("pure_function(0, 0) = %d\n", pure_function(0, 0));
printf("impure_function(0, 0) = %d\n", impure_function(0, 0));
return 0;
}
I compiled this program with gcc -O2 -Wall -Wextra, expecting that an error, or at least a warning, should have been issued for decorating impure_function() with __attribute__((pure)). However, I received no warnings or errors, and the program also ran without issues.
Isn't marking impure_function() with __attribute__((pure)) incorrect? If so, why does it compile without any errors or warnings, even with the -Wextra and -Wall flags?
Thanks in advance!
Doing this is incorrect and you are responsible for using the attribute correctly.
Look at this example:
static __attribute__((pure)) int impure_function(int x, int y)
{
extern int a;
a++;
return x + y;
}
void caller()
{
impure_function(1, 1);
}
Code generated by GCC (with -O1) for the function caller is:
caller():
ret
As you can see, the impure_function call was completely removed because compiler treats it as "pure".
GCC can mark the function as "pure" internally automatically if it sees its definition:
static __attribute__((noinline)) int pure_function(int x, int y)
{
return x + y;
}
void caller()
{
pure_function(1, 1);
}
Generated code:
caller():
ret
So there is no point in using this attribute on functions that are visible to the compiler. It is supposed to be used when definition is not available, for example when function is defined in another DLL. That means that when it is used in a proper place the compiler won't be able to perform a sanity check anyway. Implementing a warning thus is not very useful (although not meaningless).
I don't think there is anything stopping GCC developers from implementing such warning, except time that must be spend.
A pure function is a hint for the optimizing compiler. Probably, gcc don't care about pure functions when you pass just -O0 to it (the default optimizations). So if f is pure (and defined outside of your translation unit, e.g. in some outside library), the GCC compiler might optimize y = f(x) + f(x); into something like
{
int tmp = f(x); /// tmp is a fresh variable, not appearing elsewhere
y = tmp + tmp;
}
but if f is not pure (which is the usual case: think of f calling printf or malloc), such an optimization is forbidden.
Standard math functions like sin or sqrt are pure (except for IEEE rounding mode craziness, see http://floating-point-gui.de/ and Fluctuat for more), and they are complex enough to compute to make such optimizations worthwhile.
You might compile your code with gcc -O2 -Wall -fdump-tree-all to guess what is happening inside the compiler. You could add the -fverbose-asm -S flags to get a generated *.s assembler file.
You could also read the Bismon draft report (notably its section §1.4). It might give some intuitions related to your question.
In your particular case, I am guessing that gcc is inlining your calls; and then purity matters less.
If you have time to spend, you might consider writing your own GCC plugin to make such a warning. You'll spend months in writing it! These old slides might still be useful to you, even if the details are obsolete.
At the theoretical level, be aware of Rice's theorem. A consequence of it is that perfect optimization of pure functions is probably impossible.
Be aware of the GCC Resource Center, located in Bombay.
So I have the following code:
#include <math.h>
int main (void) {
float max = fmax (1.0,2.0);
return 0;
}
Which compiles and runs fine, but if instead of passing 1.0 and 2.0 to the function I pass a, b with those values:
#include <math.h>
int main (void) {
float a = 1.0; float b = 2.0;
float max = fmax (a,b);
return 0;
}
I get the following error:
undefined reference to `fmax'
What is the diffrence? What I'm doing wrong?
I'm using this command to compile:
c99 fmax_test.c
In the first case fmax probably gets optimised away at compile time. In the second case it does not and you then get a link error. Without knowing what compiler you are using it's hard to give a specific remedy, but if it's gcc then you may need to add -lm, e.g.
c99 -Wall fmax_test.c -lm
Note also that fmax is for doubles - you should be using fmaxf for floats.
compile with -lm
i'm using gcc. maybe not OK with your compiler.
try this:
c99 fmax_test.c -lm
Question
I am testing a simple code which calculates Mandelbrot fractal. I have been checking its performance depending on the number of iterations in the function that checks if a point belongs to the Mandelbrot set or not.
The surprising thing is that I am getting a big difference in times after adding the -fPIC flag. From what I read the overhead is usually negligible and the highest overhead I came across was about 6%. I measured around 30% overhead. Any advice will be appreciated!
Details of my project
I use the -O3 flag, gcc 4.7.2, Ubuntu 12.04.2, x86_64.
The results look as follow
#iter C (fPIC) C C/C(fPIC)
1 0.01 0.01 1.00
100 0.04 0.03 0.75
200 0.06 0.04 0.67
500 0.15 0.1 0.67
1000 0.28 0.19 0.68
2000 0.56 0.37 0.66
4000 1.11 0.72 0.65
8000 2.21 1.47 0.67
16000 4.42 2.88 0.65
32000 8.8 5.77 0.66
64000 17.6 11.53 0.66
Commands I use:
gcc -O3 -fPIC fractalMain.c fractal.c -o ffpic
gcc -O3 fractalMain.c fractal.c -o f
Code: fractalMain.c
#include <time.h>
#include <stdio.h>
#include <stdbool.h>
#include "fractal.h"
int main()
{
int iterNumber[] = {1, 100, 200, 500, 1000, 2000, 4000, 8000, 16000, 32000, 64000};
int it;
for(it = 0; it < 11; ++it)
{
clock_t start = clock();
fractal(iterNumber[it]);
clock_t end = clock();
double millis = (end - start)*1000 / CLOCKS_PER_SEC/(double)1000;
printf("Iter: %d, time: %lf \n", iterNumber[it], millis);
}
return 0;
}
Code: fractal.h
#ifndef FRACTAL_H
#define FRACTAL_H
void fractal(int iter);
#endif
Code: fractal.c
#include <stdio.h>
#include <stdbool.h>
#include "fractal.h"
void multiplyComplex(double a_re, double a_im, double b_re, double b_im, double* res_re, double* res_im)
{
*res_re = a_re*b_re - a_im*b_im;
*res_im = a_re*b_im + a_im*b_re;
}
void sqComplex(double a_re, double a_im, double* res_re, double* res_im)
{
multiplyComplex(a_re, a_im, a_re, a_im, res_re, res_im);
}
bool isInSet(double P_re, double P_im, double C_re, double C_im, int iter)
{
double zPrev_re = P_re;
double zPrev_im = P_im;
double zNext_re = 0;
double zNext_im = 0;
double* p_zNext_re = &zNext_re;
double* p_zNext_im = &zNext_im;
int i;
for(i = 1; i <= iter; ++i)
{
sqComplex(zPrev_re, zPrev_im, p_zNext_re, p_zNext_im);
zNext_re = zNext_re + C_re;
zNext_im = zNext_im + C_im;
if(zNext_re*zNext_re+zNext_im*zNext_im > 4)
{
return false;
}
zPrev_re = zNext_re;
zPrev_im = zNext_im;
}
return true;
}
bool isMandelbrot(double P_re, double P_im, int iter)
{
return isInSet(0, 0, P_re, P_im, iter);
}
void fractal(int iter)
{
int noIterations = iter;
double xMin = -1.8;
double xMax = 1.6;
double yMin = -1.3;
double yMax = 0.8;
int xDim = 512;
int yDim = 384;
double P_re, P_im;
int nop;
int x, y;
for(x = 0; x < xDim; ++x)
for(y = 0; y < yDim; ++y)
{
P_re = (double)x*(xMax-xMin)/(double)xDim+xMin;
P_im = (double)y*(yMax-yMin)/(double)yDim+yMin;
if(isMandelbrot(P_re, P_im, noIterations))
nop = x+y;
}
printf("%d", nop);
}
Story behind the comparison
It might look a bit artificial to add the -fPIC flag when building executable (as per one of the comments). So a few words of explanation: first I only compiled the program as executable and wanted to compare to my Lua code, which calls the isMandelbrot function from C. So I created a shared object to call it from lua - and had big time differences. But couldn't understand why they were growing with number of iterations. In the end found out that it was because of the -fPIC. When I create a little c program which calls my lua script (so effectively I do the same thing, only don't need the .so) - the times are very similar to C (without -fPIC). So I have checked it in a few configurations over the last few days and it consistently shows two sets of very similar results: faster without -fPIC and slower with it.
It turns out that when you compile without the -fPIC option multiplyComplex, sqComplex, isInSet and isMandelbrot are inlined automatically by the compiler. If you define those functions as static you will likely get the same performance when compiling with -fPIC because the compiler will be free to perform inlining.
The reason why the compiler is unable to automatically inline the helper functions has to do with symbol interposition. Position independent code is required to access all global data indirectly, i.e. through the global offset table. The very same constraint applies to function calls, which have to go through the procedure linkage table. Since a symbol might get interposed by another one at runtime (see LD_PRELOAD), the compiler cannot simply assume that it is safe to inline a function with global visibility.
The very same assumption can be made if you compile without -fPIC, i.e. the compiler can safely assume that a global symbol defined in the executable cannot be interposed because the lookup scope begins with the executable itself which is then followed by all other libraries, including the preloaded ones.
For a more thorough understanding have a look at the following paper.
As other people already pointed out -fPIC forces GCC to disable many optimizations e.g. inlining and cloning. I'd like to point out several ways to overcome this:
replace -fPIC with -fPIE if you are compiling main executable (not libraries) as this allows compiler to assume that interposition is not possible;
use -fvisibility=hidden and __attribute__((visibility("default"))) to export only necessary functions from the library and hide the rest; this would allow GCC to optimize hidden functions more aggressively;
use private symbol aliases (__attribute__((alias ("__f")));) to refer to library functions from within the library; this would again untie GCC's hands
previous suggestion can be automated with -fno-semantic-interposition flag that was added in recent GCC versions
It's interesting to note that Clang is different from GCC as it allows all optimizations by default regardless of -fPIC (can be overridden with -fsemantic-interposition to obtain GCC-like behavior).
As others have discussed in the comment section of your opening post, compiling with -flto should help to reduce the difference in run-times you are seeing for this particular case, since the link time optimisations of gcc will likely figure out that it's actually ok to inline a couple of functions ;)
In general, link time optimisations could lead to massive reductions in code size (~6%) link to paper on link time optimisations in gold, and thus run time as well (more of your program fits in the cache). Also note that -fPIC is mostly viewed as a feature that enables tighter security and is always enabled in android. This question on SO briefly discusses as well. Also, just to let you know, -fpic is the faster version of -fPIC, so if you must use -fPIC try -fpic instead - link to gcc docs. For x86 it might not make a difference, but you need to check this for yourself/ask on gcc-help.