I have
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <stdio.h>
#include <math.h>
int fib(int n) {
return n < 2 ? n : fib(n-1) + fib(n-2);
}
double clock_now()
{
struct timeval now;
gettimeofday(&now, NULL);
return (double)now.tv_sec + (double)now.tv_usec/1.0e6;
}
#define NITER 5
and in my main(), I'm doing a simple benchmark like this:
printf("hi\n");
double t = clock_now();
int f = 0;
double tmin = INFINITY;
for (int i=0; i<NITER; ++i) {
printf("run %i, %f\n", i, clock_now()-t);
t = clock_now();
f += fib(40);
t = clock_now()-t;
printf("%i %f\n", f, t);
if (t < tmin) tmin = t;
t = clock_now();
}
printf("fib,%.6f\n", tmin*1000);
When I compile with clang -O3 (LLVM 5.0 from Xcode 5.0.1), it always prints out zero time, except at the init of the for-loop, i.e. this:
hi
run 0, 0.866536
102334155 0.000000
run 1, 0.000001
204668310 0.000000
run 2, 0.000000
307002465 0.000000
run 3, 0.000000
409336620 0.000000
run 4, 0.000001
511670775 0.000000
fib,0.000000
It seems that it statically precalculates the fib(40) and stores it somewhere. Right? The strange lag at the beginning (0.8 secs) is probably because it loads that cache?
I'm doing this for benchmarking. The C compiler should optimize fib() itself as much as it can. However, I don't want it to precalculate it already at compile time. So basically I want all code optimized as heavily as possible, but not main() (or at least not this specific optimization). Can I do that somehow?
What optimization is it anyway in this specific case? It's somehow strange and quite nice.
I found a solution by marking certain data volatile. Esp, what I did was:
volatile int f = 0;
//...
volatile int FibArg = 40;
f += fib(FibArg);
That way, it forces the compiler to read FibArg when calling the function and it forces it to not assume that it is constant. Thus it must call the function to calculate it.
The volatile int f was not necessary for my compiler at the moment but it might be in the future when the compiler figures out that fib has no side effects and its result nor f is every used.
Note that this is still not the end. A future compiler could have advanced so far that it guesses that 40 is a likely argument for fib. Maybe it builds a database for likely values. And for the most likely values, it builds up a small cache. And when fib is called, it does a fast runtime-check whether it has that value cached. Of course, the runtime-check adds some overhead but maybe the compiler estimates that this overhead is minor for some particular code in relation to the speed gained by the cached.
I'm not sure if a compiler will ever do such optimization but it could. Profile Guided Optimization (PGO) goes already in that direction.
Related
I have the following simplified ReLU simulation code that I am trying to optimize. The code uses a ternary operation which is perhaps coming in the way of automatic vectorization by the compiler. How can I vectorize this code?
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <mkl.h>
void relu(double* &Amem, double* Z_curr, int bo)
{
for (int i=0; i<bo; ++i) {
Amem[i] = Z_curr[i] > 0 ? Z_curr[i] : Z_curr[i]*0.1;
}
}
int main()
{
int i, j;
int batch_size = 16384;
int output_dim = 21;
// double* Amem = new double[batch_size*output_dim];
// double* Z_curr = new double[batch_size*output_dim];
double* Amem = (double *)mkl_malloc(batch_size*output_dim*sizeof( double ), 64 );
double* Z_curr = (double *)mkl_malloc(batch_size*output_dim*sizeof( double ), 64 );
memset(Amem, 0, sizeof(double)*batch_size*output_dim);
for (i=0; i<batch_size*output_dim; ++i) {
Z_curr[i] = -1+2*((double)rand())/RAND_MAX;
}
relu(Amem, Z_curr, batch_size*output_dim);
}
To compile it, if you have MKL then use the following, otherwise plain g++ -O3.
g++ -O3 ex.cxx -L${MKLROOT}/lib/intel64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5
So far, I have tried adding -march=skylake-avx512 as a compiler option, but it does not vectorize the loop as I found using option -fopt-info-vec-all for compilation:
ex.cxx:9:16: missed: couldn't vectorize loop
ex.cxx:9:16: missed: not vectorized: control flow in loop.
ex.cxx:6:6: note: vectorized 0 loops in function.
ex.cxx:9:16: missed: couldn't vectorize loop
ex.cxx:9:16: missed: not vectorized: control flow in loop.
and this is the time it takes currently at my end:
time ./a.out
real 0m0.034s
user 0m0.026s
sys 0m0.009s
There is usually no benefit to pass a pointer by reference (unless you want to modify the pointer itself). Furthermore, you can help your compiler using the (non-standard) __restrict keyword, telling it that no aliasing happens between input and output (of course, this will likely give wrong results, if e.g., Amem == Z_curr+1 -- but Amem == Z_curr should (in this case) be fine).
void relu(double* __restrict Amem, double* Z_curr, int bo)
Using that alone, clang actually is capable of vectorizing your loop using vcmpltpd and masked moves (for some reasons, only using 256bit registers).
If you simplify your expression to std::max(Z_curr[i], 0.1*Z_curr[i]) even gcc easily is capable of vectorizing it: https://godbolt.org/z/eTv4PnMWb
Generally, I would suggest compiling crucial routines of your code with different compilers and different compile options (sometimes trying -ffast-math can show you ways to simplify your expressions) and have a look at the generated code. For portability you could then re-translate the generated code into intrinsics (or leave it as is, if every compiler you care about gives good-enough results).
For completeness, here is a possible manually vectorized implementation using intrinsics:
void relu_avx512(double* __restrict Amem, double* Z_curr, int bo)
{
int i;
for (i=0; i<=bo-8; i+=8)
{
__m512d z = _mm512_loadu_pd(Z_curr+i);
__mmask8 positive = _mm512_cmplt_pd_mask (_mm512_setzero_pd(), z);
__m512d res = _mm512_mask_mul_pd(z, positive, z, _mm512_set1_pd(0.9));
_mm512_storeu_pd(Amem+i, res);
}
// remaining elements in scalar loop
for (; i<bo; ++i) {
Amem[i] = 0.0 < Z_curr[i] ? Z_curr[i] : Z_curr[i]*0.1;;
}
}
Godbolt: https://godbolt.org/z/s6br5KEEc (if you compile this with -O2 or -O3 on clang, it will heavily unroll the cleanup loop, even though it can't have more than 7 iterations. Theoretically, you could do the remaining elements with a masked or overlapping store (or maybe you have use-cases where the size is guaranteed to be a multiple of 8 and you can leave it away).
I have the following C program:
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <math.h>
int main() {
const int opt_count = 2;
int oc = 30;
int c = 900;
printf("%d %f\n", c, pow(oc, opt_count));
assert(c == (int)(pow(oc, opt_count)));
}
I'm running MinGW on Windows 8.1. Gcc version 4.9.3. I compile my program with:
gcc program.c -o program.exe
When I run it I get this output:
$ program
900 900.000000
Assertion failed: c == (int)(pow(oc, opt_count)), file program.c, line 16
This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
What is going on? I expect the assertion to pass because 900 == 30^2.
Thanks!
Edit
I'm not using any fractions or decimals. I'm only using integers.
This happens when the implementation of pow is via
pow(x,y) = exp(log(x)*y)
Other library implementations first reduce the exponent by integer powers, thus avoiding this small floating point error.
More involved implementations contain steps like
pow(x,y) {
if(y<0) return 1/pow(x, -y);
n = (int)round(y);
y = y-n;
px = x; powxn = 1;
while(n>0) {
if(n%2==1) powxn *= px;
n /=2; px *= px;
}
return powxn * exp(log(x)*y);
}
with the usual divide-n-conquer resp. halving-n-squaring approach for the integer power powxn.
You have a nice answer (and solution) from #LutzL, another solution is comparing the difference with an epsilon, e.g.: 0.00001, in this way you can use the standard function pow included in math.h
#define EPSILON 0.0001
#define EQ(a, b) (fabs(a - b) < EPSILON)
assert(EQ((double)c, pow(oc, opt_count)));
I'm learning the basics of SIMD so I was given a simple code snippet to see the principle at work with SSE and SSE2.
I recently installed minGW to compile C code in windows with gcc instead of using the visual studio compiler.
The objective of the example is to add two floats and then multiply by a third one.
The headers included are the following (which I guess are used to be able to use the SSE intrinsics):
#include <time.h>
#include <stdio.h>
#include <xmmintrin.h>
#include <pmmintrin.h>
#include <time.h>
#include <sys/time.h> // for timing
Then I have a function to check what time it is, to compare time between calculations:
double now(){
struct timeval t; double f_t;
gettimeofday(&t, NULL);
f_t = t.tv_usec; f_t = f_t/1000000.0; f_t +=t.tv_sec;
return f_t;
}
The function to do the calculation in the "scalar" sense is the following:
void run_scalar(){
unsigned int i;
for( i = 0; i < N; i++ ){
rs[i] = (a[i]+b[i])*c[i];
}
}
Here is the code for the sse2 function:
void run_sse2(){
unsigned int i;
__m128 *mm_a = (__m128 *)a;
__m128 *mm_b = (__m128 *)b;
__m128 *mm_c = (__m128 *)c;
__m128 *mm_r = (__m128 *)rv;
for( i = 0; i <N/4; i++)
mm_r[i] = _mm_mul_ps(_mm_add_ps(mm_a[i],mm_b[i]),mm_c[i]);
}
The vectors are defined the following way (N is the size of the vectors and it is defined elsewhere) and a function init() is called to initialize them:
float a[N] __attribute__((aligned(16)));
float b[N] __attribute__((aligned(16)));
float c[N] __attribute__((aligned(16)));
float rs[N] __attribute__((aligned(16)));
float rv[N] __attribute__((aligned(16)));
void init(){
unsigned int i;
for( i = 0; i < N; i++ ){
a[i] = (float)rand () / RAND_MAX / N;
b[i] = (float)rand () / RAND_MAX / N;
c[i] = (float)rand () / RAND_MAX / N;
}
}
Finally here is the main that calls the functions and prints the results and computing time.
int main(){
double t;
init();
t = now();
run_scalar();
t = now()-t;
printf("S = %10.9f Temps du code scalaire : %f seconde(s)\n",1e5*sum(rs),t);
t = now();
run_sse2();
t = now()-t;
printf("S = %10.9f Temps du code vectoriel 2: %f seconde(s)\n",1e5*sum(rv),t);
}
For sum reason if I compile this code with a command line of "gcc -o vec vectorial.c -msse -msse2 -msse3" or "mingw32-gcc -o vec vectorial.c -msse -msse2 -msse3"" it compiles without any problems, but for some reason I can't run it in my windows machine, in the command prompt I get an "access denied" and a big message appears on the screen saying "This app can't run on your PC, to find a version for your PC, check with the software publisher".
I don't really understand what is going on, neither do I have much experience with MinGW or C (just an introductory course to C++ done on Linux machines). I've tried playing around with different headers because I thought maybe I was targeting a different processor than the one on my PC but couldn't solve the issue. Most of the info I found was confusing.
Can someone help me understand what is going on? Is it a problem in the minGW configuration that is compiling in targeting a Linux platform? Is it something in the code that doesn't have the equivalent in windows?
I'm trying to run it on a 64 bit Windows 8.1 pc
Edit: Tried the configuration suggested in the site linked below. The output remains the same.
If I try to run through MSYS I get a "Bad File number"
If I try to run throught the command prompt I get Access is Denied.
I'm guessing there's some sort of bug arising from permissions. Tried turning off the antivirus and User Account control but still no luck.
Any ideas?
There is nothing wrong with your code, besides, you did not provide the definition of sum() or N which is, however, not a problem. The switches -msse -msse2 appear to be not required.
I was able to compile and run your code on Linux (Ubuntu x86_64, compiled with gcc 4.8.2 and 4.6.3, on Atom D2700 and AMD Athlon LE-1640) and Windows7/64 (compiled with gcc 4.5.3 (32bit) and 4.8.2 (64bit), on Core i3-4330 and Core i7-4960X). It was running without problem.
Are you sure your CPU supports the required instructions? What exactly was the error code you got? Which MinGW configuration did you use? Out of curiosity, I used the one available at http://win-builds.org/download.html which was very straight-forward.
However, using the optimization flag -O3 created the best result -- with the scalar loop! Also useful are -m64 -mtune=native -s.
Question
I am testing a simple code which calculates Mandelbrot fractal. I have been checking its performance depending on the number of iterations in the function that checks if a point belongs to the Mandelbrot set or not.
The surprising thing is that I am getting a big difference in times after adding the -fPIC flag. From what I read the overhead is usually negligible and the highest overhead I came across was about 6%. I measured around 30% overhead. Any advice will be appreciated!
Details of my project
I use the -O3 flag, gcc 4.7.2, Ubuntu 12.04.2, x86_64.
The results look as follow
#iter C (fPIC) C C/C(fPIC)
1 0.01 0.01 1.00
100 0.04 0.03 0.75
200 0.06 0.04 0.67
500 0.15 0.1 0.67
1000 0.28 0.19 0.68
2000 0.56 0.37 0.66
4000 1.11 0.72 0.65
8000 2.21 1.47 0.67
16000 4.42 2.88 0.65
32000 8.8 5.77 0.66
64000 17.6 11.53 0.66
Commands I use:
gcc -O3 -fPIC fractalMain.c fractal.c -o ffpic
gcc -O3 fractalMain.c fractal.c -o f
Code: fractalMain.c
#include <time.h>
#include <stdio.h>
#include <stdbool.h>
#include "fractal.h"
int main()
{
int iterNumber[] = {1, 100, 200, 500, 1000, 2000, 4000, 8000, 16000, 32000, 64000};
int it;
for(it = 0; it < 11; ++it)
{
clock_t start = clock();
fractal(iterNumber[it]);
clock_t end = clock();
double millis = (end - start)*1000 / CLOCKS_PER_SEC/(double)1000;
printf("Iter: %d, time: %lf \n", iterNumber[it], millis);
}
return 0;
}
Code: fractal.h
#ifndef FRACTAL_H
#define FRACTAL_H
void fractal(int iter);
#endif
Code: fractal.c
#include <stdio.h>
#include <stdbool.h>
#include "fractal.h"
void multiplyComplex(double a_re, double a_im, double b_re, double b_im, double* res_re, double* res_im)
{
*res_re = a_re*b_re - a_im*b_im;
*res_im = a_re*b_im + a_im*b_re;
}
void sqComplex(double a_re, double a_im, double* res_re, double* res_im)
{
multiplyComplex(a_re, a_im, a_re, a_im, res_re, res_im);
}
bool isInSet(double P_re, double P_im, double C_re, double C_im, int iter)
{
double zPrev_re = P_re;
double zPrev_im = P_im;
double zNext_re = 0;
double zNext_im = 0;
double* p_zNext_re = &zNext_re;
double* p_zNext_im = &zNext_im;
int i;
for(i = 1; i <= iter; ++i)
{
sqComplex(zPrev_re, zPrev_im, p_zNext_re, p_zNext_im);
zNext_re = zNext_re + C_re;
zNext_im = zNext_im + C_im;
if(zNext_re*zNext_re+zNext_im*zNext_im > 4)
{
return false;
}
zPrev_re = zNext_re;
zPrev_im = zNext_im;
}
return true;
}
bool isMandelbrot(double P_re, double P_im, int iter)
{
return isInSet(0, 0, P_re, P_im, iter);
}
void fractal(int iter)
{
int noIterations = iter;
double xMin = -1.8;
double xMax = 1.6;
double yMin = -1.3;
double yMax = 0.8;
int xDim = 512;
int yDim = 384;
double P_re, P_im;
int nop;
int x, y;
for(x = 0; x < xDim; ++x)
for(y = 0; y < yDim; ++y)
{
P_re = (double)x*(xMax-xMin)/(double)xDim+xMin;
P_im = (double)y*(yMax-yMin)/(double)yDim+yMin;
if(isMandelbrot(P_re, P_im, noIterations))
nop = x+y;
}
printf("%d", nop);
}
Story behind the comparison
It might look a bit artificial to add the -fPIC flag when building executable (as per one of the comments). So a few words of explanation: first I only compiled the program as executable and wanted to compare to my Lua code, which calls the isMandelbrot function from C. So I created a shared object to call it from lua - and had big time differences. But couldn't understand why they were growing with number of iterations. In the end found out that it was because of the -fPIC. When I create a little c program which calls my lua script (so effectively I do the same thing, only don't need the .so) - the times are very similar to C (without -fPIC). So I have checked it in a few configurations over the last few days and it consistently shows two sets of very similar results: faster without -fPIC and slower with it.
It turns out that when you compile without the -fPIC option multiplyComplex, sqComplex, isInSet and isMandelbrot are inlined automatically by the compiler. If you define those functions as static you will likely get the same performance when compiling with -fPIC because the compiler will be free to perform inlining.
The reason why the compiler is unable to automatically inline the helper functions has to do with symbol interposition. Position independent code is required to access all global data indirectly, i.e. through the global offset table. The very same constraint applies to function calls, which have to go through the procedure linkage table. Since a symbol might get interposed by another one at runtime (see LD_PRELOAD), the compiler cannot simply assume that it is safe to inline a function with global visibility.
The very same assumption can be made if you compile without -fPIC, i.e. the compiler can safely assume that a global symbol defined in the executable cannot be interposed because the lookup scope begins with the executable itself which is then followed by all other libraries, including the preloaded ones.
For a more thorough understanding have a look at the following paper.
As other people already pointed out -fPIC forces GCC to disable many optimizations e.g. inlining and cloning. I'd like to point out several ways to overcome this:
replace -fPIC with -fPIE if you are compiling main executable (not libraries) as this allows compiler to assume that interposition is not possible;
use -fvisibility=hidden and __attribute__((visibility("default"))) to export only necessary functions from the library and hide the rest; this would allow GCC to optimize hidden functions more aggressively;
use private symbol aliases (__attribute__((alias ("__f")));) to refer to library functions from within the library; this would again untie GCC's hands
previous suggestion can be automated with -fno-semantic-interposition flag that was added in recent GCC versions
It's interesting to note that Clang is different from GCC as it allows all optimizations by default regardless of -fPIC (can be overridden with -fsemantic-interposition to obtain GCC-like behavior).
As others have discussed in the comment section of your opening post, compiling with -flto should help to reduce the difference in run-times you are seeing for this particular case, since the link time optimisations of gcc will likely figure out that it's actually ok to inline a couple of functions ;)
In general, link time optimisations could lead to massive reductions in code size (~6%) link to paper on link time optimisations in gold, and thus run time as well (more of your program fits in the cache). Also note that -fPIC is mostly viewed as a feature that enables tighter security and is always enabled in android. This question on SO briefly discusses as well. Also, just to let you know, -fpic is the faster version of -fPIC, so if you must use -fPIC try -fpic instead - link to gcc docs. For x86 it might not make a difference, but you need to check this for yourself/ask on gcc-help.
A simple C program which uses gettimeofday() works fine when compiled without any flags ( gcc-4.5.1) but doesn't give output when compiled with the flag -mno-sse.
#include <stdio.h>
#include <stdlib.h>
int main()
{
struct timeval s,e;
float time;
int i;
gettimeofday(&s, NULL);
for( i=0; i< 10000; i++);
gettimeofday(&e, NULL);
time = e.tv_sec - s.tv_sec + e.tv_usec - s.tv_usec;
printf("%f\n", time);
return 0;
}
I have CFLAGS=-march=native -mtune=native
Could someone explain why this happens?
The program returns a correct value normally, but prints "0" when compiled with -mno-sse enabled.
The flag -mno-sse causes floating point arguments to be passed on the stack, whereas the usual x86_64 ABI specifies that they should be passed via SSE registers.
Since printf() in your C library was compiled without -mno-sse, it is expecting floating point arguments to be passed in accordance with the ABI. This is why your code fails. It has nothing to do with gettimeofday().
If you wish to use printf() from your code compiled with -mno-sse and pass it floating point arguments, you will need to recompile your C library with that option and link against that version.
It appears that you are using a loop which does nothing in order to observe a time difference. The problem is, the compiler may optimize this loop away entirely. The issue may not be with the -mno-sse itself, but may be that that allows an optimization that removes the loop, thus giving you the same time each time you run it.
I would recommend trying to put something in that loop which can't be optimized out (such as incrementing a number which you print out at the end). See if you still get the same behavior. If not, I'd recommend looking at the generated assembler gcc -S and see what the code difference is.
The datastructures tv_usec and tv_sec are usually longs.
Redeclaration of the variable "time" as a long integer solved the issue.
The following link addresses the issue.
http://gcc.gnu.org/ml/gcc-patches/2006-10/msg00525.html
Working code:
#include <stdio.h>
#include <stdlib.h>
int main()
{
struct timeval s,e;
long time;
int i;
gettimeofday(&s, NULL);
for( i=0; i< 10000; i++);
gettimeofday(&e, NULL);
time = e.tv_sec - s.tv_sec + e.tv_usec - s.tv_usec;
printf("%ld\n", time);
return 0;
}
Thanks for the prompt replies. Hope this helps.
What do you mean doesn't give output?
0 (zero) is a perfectly reasonable output to expect.
Edit: Try compiling to assembler (gcc -S ...) and see the differences between the normal and the no-sse version.