Consider follwing header file, "tls.h":
#include <stdint.h>
// calling this function is expensive
uint64_t foo(uint64_t x);
extern __thread uint64_t cache;
static inline uint64_t
get(uint64_t x)
{
// if cache is not valid
if (cache == UINT64_MAX)
cache = foo(x);
return cache + x;
}
and source file "tls.c":
#include "tls.h"
__thread uint64_t cache = {0};
uint64_t foo(uint64_t x)
{
// imagine some calculations are performed here
return 0;
}
Below is example usage of get() function in "main.c":
#include "tls.h"
uint64_t t = 0;
int main()
{
uint64_t x = 0;
for(uint64_t i = 0; i < 1024UL * 1024 * 1024; i++){
t += get(i);
x++;
}
}
Presented files are compiled as following:
gcc -c -O3 tls.c
gcc -c -O3 main.c
gcc -O3 main.o tls.o
Examining the performance of loop in "main.c" revealed that compiler optimization is very poor. After disassembling the binary, it is clear that tls is being accessed in every iteration.
Execution time on my machine is 1.7s.
However, if I remove check for cache validity in get() method so that it looks like this:
static inline uint64_t
get(uint64_t x)
{
return cache + x;
}
the compiler is now able to create much faster code - it completely removes the loop and generates only only one "add" instruction. Execution time is ~0.02s.
Why the compiler is not able to optimize the first case? TLS variable cannot be changed by other threads so compiler should be able to optimize this, right?
Is there any other way I can optimize the get() function?
Related
I have the following simplified ReLU simulation code that I am trying to optimize. The code uses a ternary operation which is perhaps coming in the way of automatic vectorization by the compiler. How can I vectorize this code?
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <mkl.h>
void relu(double* &Amem, double* Z_curr, int bo)
{
for (int i=0; i<bo; ++i) {
Amem[i] = Z_curr[i] > 0 ? Z_curr[i] : Z_curr[i]*0.1;
}
}
int main()
{
int i, j;
int batch_size = 16384;
int output_dim = 21;
// double* Amem = new double[batch_size*output_dim];
// double* Z_curr = new double[batch_size*output_dim];
double* Amem = (double *)mkl_malloc(batch_size*output_dim*sizeof( double ), 64 );
double* Z_curr = (double *)mkl_malloc(batch_size*output_dim*sizeof( double ), 64 );
memset(Amem, 0, sizeof(double)*batch_size*output_dim);
for (i=0; i<batch_size*output_dim; ++i) {
Z_curr[i] = -1+2*((double)rand())/RAND_MAX;
}
relu(Amem, Z_curr, batch_size*output_dim);
}
To compile it, if you have MKL then use the following, otherwise plain g++ -O3.
g++ -O3 ex.cxx -L${MKLROOT}/lib/intel64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5
So far, I have tried adding -march=skylake-avx512 as a compiler option, but it does not vectorize the loop as I found using option -fopt-info-vec-all for compilation:
ex.cxx:9:16: missed: couldn't vectorize loop
ex.cxx:9:16: missed: not vectorized: control flow in loop.
ex.cxx:6:6: note: vectorized 0 loops in function.
ex.cxx:9:16: missed: couldn't vectorize loop
ex.cxx:9:16: missed: not vectorized: control flow in loop.
and this is the time it takes currently at my end:
time ./a.out
real 0m0.034s
user 0m0.026s
sys 0m0.009s
There is usually no benefit to pass a pointer by reference (unless you want to modify the pointer itself). Furthermore, you can help your compiler using the (non-standard) __restrict keyword, telling it that no aliasing happens between input and output (of course, this will likely give wrong results, if e.g., Amem == Z_curr+1 -- but Amem == Z_curr should (in this case) be fine).
void relu(double* __restrict Amem, double* Z_curr, int bo)
Using that alone, clang actually is capable of vectorizing your loop using vcmpltpd and masked moves (for some reasons, only using 256bit registers).
If you simplify your expression to std::max(Z_curr[i], 0.1*Z_curr[i]) even gcc easily is capable of vectorizing it: https://godbolt.org/z/eTv4PnMWb
Generally, I would suggest compiling crucial routines of your code with different compilers and different compile options (sometimes trying -ffast-math can show you ways to simplify your expressions) and have a look at the generated code. For portability you could then re-translate the generated code into intrinsics (or leave it as is, if every compiler you care about gives good-enough results).
For completeness, here is a possible manually vectorized implementation using intrinsics:
void relu_avx512(double* __restrict Amem, double* Z_curr, int bo)
{
int i;
for (i=0; i<=bo-8; i+=8)
{
__m512d z = _mm512_loadu_pd(Z_curr+i);
__mmask8 positive = _mm512_cmplt_pd_mask (_mm512_setzero_pd(), z);
__m512d res = _mm512_mask_mul_pd(z, positive, z, _mm512_set1_pd(0.9));
_mm512_storeu_pd(Amem+i, res);
}
// remaining elements in scalar loop
for (; i<bo; ++i) {
Amem[i] = 0.0 < Z_curr[i] ? Z_curr[i] : Z_curr[i]*0.1;;
}
}
Godbolt: https://godbolt.org/z/s6br5KEEc (if you compile this with -O2 or -O3 on clang, it will heavily unroll the cleanup loop, even though it can't have more than 7 iterations. Theoretically, you could do the remaining elements with a masked or overlapping store (or maybe you have use-cases where the size is guaranteed to be a multiple of 8 and you can leave it away).
I tried compiling this code,
#include <stdlib.h>
struct rgb {
int r, g, b;
};
void adjust_brightness(struct rgb *picdata, size_t len, int adjustment) {
// assume adjustment is between 0 and 255.
for (int i = 0; i < len; i++) {
picdata[i].r += adjustment;
picdata[i].g += adjustment;
picdata[i].b += adjustment;
}
}
on OSX using this command,
$ cc -Rpass-analysis=loop-vectorize -c -std=c99 -O3 brightness.c
brightness.c:13:3: remark: loop not vectorized: unsafe dependent memory operations in loop [-Rpass-analysis=loop-vectorize]
for (int i = 0; i < len; i++) {
^
Can someone explain what is unsafe and dependent here? I'm learning about SIMD, and this was explained at the most obvious use for SIMD. I was hoping to learn how the compiler would generated SIMD instructions for a simple example. In my head, I expect the compiler to maybe instead of incrementing by 1, it would increment by enough to put the loop body into vector registers?
Do I misunderstand?
This one is about dereferencing stucture variables in a chain. Please consider this code:
struct ChannelInfo
{
int iData1;
int iData2;
int iData3;
int iData4;
}
struct AppInfo
{
struct ChannelInfo gChanInfo[100];
} gAppInfo;
void main()
{
gAppInfo.gChannelInfo[50].iData1 = 1;
gAppInfo.gChannelInfo[50].iData2 = 2;
gAppInfo.gChannelInfo[50].iData3 = 3;
gAppInfo.gChannelInfo[50].iData4 = 4;
foo1();
foo2();
}
void foo1()
{
printf("Data1 = %d, Data2 = %d, Data3 = %d, Data4 = %d", gAppInfo.gChannelInfo[50].iData1, gAppInfo.gChannelInfo[50].iData2, gAppInfo.gChannelInfo[50].iData3, gAppInfo.gChannelInfo[50].iData4);
}
void foo2()
{
struct ChannelInfo* pCurrrentChan = &gAppInfo.gChanInfo[50];
printf("Data1 = %d, Data2 = %d, Data3 = %d, Data4 = %d", pCurrrentChan->iData1, pCurrrentChan->iData2, pCurrrentChan->iData3, pCurrrentChan->iData4);
}
Is foo2() any faster than foo1()? What happens if the array index was not a constant, being asked for by the user? I would be grateful if someone could profile this code.
this assembly version of your code could help you understand why your code is slower. But of course it could vary depending on the target architecture and you optimization flags ( Commpiling with O2 or O3 flags produce the same code for foo1 and foo2 )
In foo2 the address of ChannelInfo is stored in a register and address are calculated relative to the value stored in the register. Or in the worst case in the stack (local variable ) where in that case it could be as slow as foo1.
In foo1 the variable address for printf are calculated relative to the variable gAppInfo stored in memory heap (or in cache ).
As per #Ludin's request I added these numbers for reference :
Execution of an instruction : 1 ns
fetch from main memory : ~100 ns
assembly version with -O2 flags ( -Os and -O3 flags produce the same code )
Pondering things like this isn't meaningful and it is pre-mature optimization, because the code will get optimized so that both those functions are equivalent.
If you for some reason would not optimize the code, foo2() will be slightly slower because it yields a few more instructions.
Please not that the call to printf is approximately 100 times slower than the rest of the code in that function, so if you are truly concerned about performance you should rather focus on avoiding stdio.h instead of doing these kinds of mini-optimizations.
At the bottom of the answer I have included some benchmarking code for Windows. Because the printf call is so slow compared to the rest of the code, and we aren't really interested in benchmarking printf itself, I removed the printf calls and replaced them with volatile variables. Meaning that the compiler is required to perform the reads no matter level of optimization.
gcc test.c -otest.exe -std=c11 -pedantic-errors -Wall -Wextra -O0
Output:
foo1 5.669101us
foo2 7.178366us
gcc test.c -otest.exe -std=c11 -pedantic-errors -Wall -Wextra -O2
Output:
foo1 2.509606us
foo2 2.506889us
As we can see, the difference in execution time of the non-optimized code corresponds roughly to the number of assembler instructions produced (see the answer by #dvhh).
Unscientifically:
10 / (10 + 16) instructions = 0.384
5.67 / (5.67 + 7.18) microseconds = 0.441
Benchmarking code:
#include <stdlib.h>
#include <stdio.h>
#include <windows.h>
struct ChannelInfo
{
int iData1;
int iData2;
int iData3;
int iData4;
};
struct AppInfo
{
struct ChannelInfo gChannelInfo[100];
} gAppInfo;
void foo1 (void);
void foo2 (void);
static double get_time_diff_us (const LARGE_INTEGER* freq,
const LARGE_INTEGER* before,
const LARGE_INTEGER* after)
{
return ((after->QuadPart - before->QuadPart)*1000.0) / (double)freq->QuadPart;
}
int main (void)
{
/*** Initialize benchmarking functions ***/
LARGE_INTEGER freq;
if(QueryPerformanceFrequency(&freq)==FALSE)
{
printf("QueryPerformanceFrequency not supported");
return 0;
}
LARGE_INTEGER time_before;
LARGE_INTEGER time_after;
gAppInfo.gChannelInfo[50].iData1 = 1;
gAppInfo.gChannelInfo[50].iData2 = 2;
gAppInfo.gChannelInfo[50].iData3 = 3;
gAppInfo.gChannelInfo[50].iData4 = 4;
const size_t ITERATIONS = 1000000;
QueryPerformanceCounter(&time_before);
for(size_t i=0; i<ITERATIONS; i++)
{
foo1();
}
QueryPerformanceCounter(&time_after);
printf("foo1 %fus\n", get_time_diff_us(&freq, &time_before, &time_after));
QueryPerformanceCounter(&time_before);
for(size_t i=0; i<ITERATIONS; i++)
{
foo2();
}
QueryPerformanceCounter(&time_after);
printf("foo2 %fus\n", get_time_diff_us(&freq, &time_before, &time_after));
}
void foo1 (void)
{
volatile int d1, d2, d3, d4;
d1 = gAppInfo.gChannelInfo[50].iData1;
d2 = gAppInfo.gChannelInfo[50].iData2;
d3 = gAppInfo.gChannelInfo[50].iData3;
d4 = gAppInfo.gChannelInfo[50].iData4;
}
void foo2 (void)
{
struct ChannelInfo* pCurrrentChan = &gAppInfo.gChannelInfo[50];
volatile int d1, d2, d3, d4;
d1 = pCurrrentChan->iData1;
d2 = pCurrrentChan->iData2;
d3 = pCurrrentChan->iData3;
d4 = pCurrrentChan->iData4;
}
yes, foo2() is definitely faster than foo1() because foo2 refers a pointer to that memory block and everytime you access it just points there and fetches value from the mmory.
I'm trying to learn how to write gcc inline assembly.
The following code is supposed to perform an shl instruction and return the result.
#include <stdio.h>
#include <inttypes.h>
uint64_t rotate(uint64_t x, int b)
{
int left = x;
__asm__ ("shl %1, %0"
:"=r"(left)
:"i"(b), "0"(left));
return left;
}
int main()
{
uint64_t a = 1000000000;
uint64_t res = rotate(a, 10);
printf("%llu\n", res);
return 0;
}
Compilation fails with error: impossible constraint in asm
The problem is basically with "i"(b). I've tried "o", "n", "m" among others but it still doesn't work. Either its this error or operand size mismatch.
What am I doing wrong?
As written, you code compiles correctly for me (I have optimization enabled). However, I believe you may find this to be a bit better:
#include <stdio.h>
#include <inttypes.h>
uint64_t rotate(uint64_t x, int b)
{
__asm__ ("shl %b[shift], %[value]"
: [value] "+r"(x)
: [shift] "Jc"(b)
: "cc");
return x;
}
int main(int argc, char *argv[])
{
uint64_t a = 1000000000;
uint64_t res = rotate(a, 10);
printf("%llu\n", res);
return 0;
}
Note that the 'J' is for 64bit. If you are using 32bit, 'I' is the correct value.
Other things of note:
You are truncating your rotate value from uint64_t to int? Are you compiling for 32bit code? I don't believe shl can do 64bit rotates when compiled as 32bit.
Allowing 'c' on the input constraint means you can use variable rotate amounts (ie not hard-coded at compile time).
Since shl modifies the flags, use "cc" to let the compiler know.
Using the [name] form makes the asm easier to read (IMO).
The %b is a modifier. See https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#i386Operandmodifiers
If you want to really get smart about inline asm, check out the latest gcc docs: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
Question
I am testing a simple code which calculates Mandelbrot fractal. I have been checking its performance depending on the number of iterations in the function that checks if a point belongs to the Mandelbrot set or not.
The surprising thing is that I am getting a big difference in times after adding the -fPIC flag. From what I read the overhead is usually negligible and the highest overhead I came across was about 6%. I measured around 30% overhead. Any advice will be appreciated!
Details of my project
I use the -O3 flag, gcc 4.7.2, Ubuntu 12.04.2, x86_64.
The results look as follow
#iter C (fPIC) C C/C(fPIC)
1 0.01 0.01 1.00
100 0.04 0.03 0.75
200 0.06 0.04 0.67
500 0.15 0.1 0.67
1000 0.28 0.19 0.68
2000 0.56 0.37 0.66
4000 1.11 0.72 0.65
8000 2.21 1.47 0.67
16000 4.42 2.88 0.65
32000 8.8 5.77 0.66
64000 17.6 11.53 0.66
Commands I use:
gcc -O3 -fPIC fractalMain.c fractal.c -o ffpic
gcc -O3 fractalMain.c fractal.c -o f
Code: fractalMain.c
#include <time.h>
#include <stdio.h>
#include <stdbool.h>
#include "fractal.h"
int main()
{
int iterNumber[] = {1, 100, 200, 500, 1000, 2000, 4000, 8000, 16000, 32000, 64000};
int it;
for(it = 0; it < 11; ++it)
{
clock_t start = clock();
fractal(iterNumber[it]);
clock_t end = clock();
double millis = (end - start)*1000 / CLOCKS_PER_SEC/(double)1000;
printf("Iter: %d, time: %lf \n", iterNumber[it], millis);
}
return 0;
}
Code: fractal.h
#ifndef FRACTAL_H
#define FRACTAL_H
void fractal(int iter);
#endif
Code: fractal.c
#include <stdio.h>
#include <stdbool.h>
#include "fractal.h"
void multiplyComplex(double a_re, double a_im, double b_re, double b_im, double* res_re, double* res_im)
{
*res_re = a_re*b_re - a_im*b_im;
*res_im = a_re*b_im + a_im*b_re;
}
void sqComplex(double a_re, double a_im, double* res_re, double* res_im)
{
multiplyComplex(a_re, a_im, a_re, a_im, res_re, res_im);
}
bool isInSet(double P_re, double P_im, double C_re, double C_im, int iter)
{
double zPrev_re = P_re;
double zPrev_im = P_im;
double zNext_re = 0;
double zNext_im = 0;
double* p_zNext_re = &zNext_re;
double* p_zNext_im = &zNext_im;
int i;
for(i = 1; i <= iter; ++i)
{
sqComplex(zPrev_re, zPrev_im, p_zNext_re, p_zNext_im);
zNext_re = zNext_re + C_re;
zNext_im = zNext_im + C_im;
if(zNext_re*zNext_re+zNext_im*zNext_im > 4)
{
return false;
}
zPrev_re = zNext_re;
zPrev_im = zNext_im;
}
return true;
}
bool isMandelbrot(double P_re, double P_im, int iter)
{
return isInSet(0, 0, P_re, P_im, iter);
}
void fractal(int iter)
{
int noIterations = iter;
double xMin = -1.8;
double xMax = 1.6;
double yMin = -1.3;
double yMax = 0.8;
int xDim = 512;
int yDim = 384;
double P_re, P_im;
int nop;
int x, y;
for(x = 0; x < xDim; ++x)
for(y = 0; y < yDim; ++y)
{
P_re = (double)x*(xMax-xMin)/(double)xDim+xMin;
P_im = (double)y*(yMax-yMin)/(double)yDim+yMin;
if(isMandelbrot(P_re, P_im, noIterations))
nop = x+y;
}
printf("%d", nop);
}
Story behind the comparison
It might look a bit artificial to add the -fPIC flag when building executable (as per one of the comments). So a few words of explanation: first I only compiled the program as executable and wanted to compare to my Lua code, which calls the isMandelbrot function from C. So I created a shared object to call it from lua - and had big time differences. But couldn't understand why they were growing with number of iterations. In the end found out that it was because of the -fPIC. When I create a little c program which calls my lua script (so effectively I do the same thing, only don't need the .so) - the times are very similar to C (without -fPIC). So I have checked it in a few configurations over the last few days and it consistently shows two sets of very similar results: faster without -fPIC and slower with it.
It turns out that when you compile without the -fPIC option multiplyComplex, sqComplex, isInSet and isMandelbrot are inlined automatically by the compiler. If you define those functions as static you will likely get the same performance when compiling with -fPIC because the compiler will be free to perform inlining.
The reason why the compiler is unable to automatically inline the helper functions has to do with symbol interposition. Position independent code is required to access all global data indirectly, i.e. through the global offset table. The very same constraint applies to function calls, which have to go through the procedure linkage table. Since a symbol might get interposed by another one at runtime (see LD_PRELOAD), the compiler cannot simply assume that it is safe to inline a function with global visibility.
The very same assumption can be made if you compile without -fPIC, i.e. the compiler can safely assume that a global symbol defined in the executable cannot be interposed because the lookup scope begins with the executable itself which is then followed by all other libraries, including the preloaded ones.
For a more thorough understanding have a look at the following paper.
As other people already pointed out -fPIC forces GCC to disable many optimizations e.g. inlining and cloning. I'd like to point out several ways to overcome this:
replace -fPIC with -fPIE if you are compiling main executable (not libraries) as this allows compiler to assume that interposition is not possible;
use -fvisibility=hidden and __attribute__((visibility("default"))) to export only necessary functions from the library and hide the rest; this would allow GCC to optimize hidden functions more aggressively;
use private symbol aliases (__attribute__((alias ("__f")));) to refer to library functions from within the library; this would again untie GCC's hands
previous suggestion can be automated with -fno-semantic-interposition flag that was added in recent GCC versions
It's interesting to note that Clang is different from GCC as it allows all optimizations by default regardless of -fPIC (can be overridden with -fsemantic-interposition to obtain GCC-like behavior).
As others have discussed in the comment section of your opening post, compiling with -flto should help to reduce the difference in run-times you are seeing for this particular case, since the link time optimisations of gcc will likely figure out that it's actually ok to inline a couple of functions ;)
In general, link time optimisations could lead to massive reductions in code size (~6%) link to paper on link time optimisations in gold, and thus run time as well (more of your program fits in the cache). Also note that -fPIC is mostly viewed as a feature that enables tighter security and is always enabled in android. This question on SO briefly discusses as well. Also, just to let you know, -fpic is the faster version of -fPIC, so if you must use -fPIC try -fpic instead - link to gcc docs. For x86 it might not make a difference, but you need to check this for yourself/ask on gcc-help.