I have the following simplified ReLU simulation code that I am trying to optimize. The code uses a ternary operation which is perhaps coming in the way of automatic vectorization by the compiler. How can I vectorize this code?
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <mkl.h>
void relu(double* &Amem, double* Z_curr, int bo)
{
for (int i=0; i<bo; ++i) {
Amem[i] = Z_curr[i] > 0 ? Z_curr[i] : Z_curr[i]*0.1;
}
}
int main()
{
int i, j;
int batch_size = 16384;
int output_dim = 21;
// double* Amem = new double[batch_size*output_dim];
// double* Z_curr = new double[batch_size*output_dim];
double* Amem = (double *)mkl_malloc(batch_size*output_dim*sizeof( double ), 64 );
double* Z_curr = (double *)mkl_malloc(batch_size*output_dim*sizeof( double ), 64 );
memset(Amem, 0, sizeof(double)*batch_size*output_dim);
for (i=0; i<batch_size*output_dim; ++i) {
Z_curr[i] = -1+2*((double)rand())/RAND_MAX;
}
relu(Amem, Z_curr, batch_size*output_dim);
}
To compile it, if you have MKL then use the following, otherwise plain g++ -O3.
g++ -O3 ex.cxx -L${MKLROOT}/lib/intel64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5
So far, I have tried adding -march=skylake-avx512 as a compiler option, but it does not vectorize the loop as I found using option -fopt-info-vec-all for compilation:
ex.cxx:9:16: missed: couldn't vectorize loop
ex.cxx:9:16: missed: not vectorized: control flow in loop.
ex.cxx:6:6: note: vectorized 0 loops in function.
ex.cxx:9:16: missed: couldn't vectorize loop
ex.cxx:9:16: missed: not vectorized: control flow in loop.
and this is the time it takes currently at my end:
time ./a.out
real 0m0.034s
user 0m0.026s
sys 0m0.009s
There is usually no benefit to pass a pointer by reference (unless you want to modify the pointer itself). Furthermore, you can help your compiler using the (non-standard) __restrict keyword, telling it that no aliasing happens between input and output (of course, this will likely give wrong results, if e.g., Amem == Z_curr+1 -- but Amem == Z_curr should (in this case) be fine).
void relu(double* __restrict Amem, double* Z_curr, int bo)
Using that alone, clang actually is capable of vectorizing your loop using vcmpltpd and masked moves (for some reasons, only using 256bit registers).
If you simplify your expression to std::max(Z_curr[i], 0.1*Z_curr[i]) even gcc easily is capable of vectorizing it: https://godbolt.org/z/eTv4PnMWb
Generally, I would suggest compiling crucial routines of your code with different compilers and different compile options (sometimes trying -ffast-math can show you ways to simplify your expressions) and have a look at the generated code. For portability you could then re-translate the generated code into intrinsics (or leave it as is, if every compiler you care about gives good-enough results).
For completeness, here is a possible manually vectorized implementation using intrinsics:
void relu_avx512(double* __restrict Amem, double* Z_curr, int bo)
{
int i;
for (i=0; i<=bo-8; i+=8)
{
__m512d z = _mm512_loadu_pd(Z_curr+i);
__mmask8 positive = _mm512_cmplt_pd_mask (_mm512_setzero_pd(), z);
__m512d res = _mm512_mask_mul_pd(z, positive, z, _mm512_set1_pd(0.9));
_mm512_storeu_pd(Amem+i, res);
}
// remaining elements in scalar loop
for (; i<bo; ++i) {
Amem[i] = 0.0 < Z_curr[i] ? Z_curr[i] : Z_curr[i]*0.1;;
}
}
Godbolt: https://godbolt.org/z/s6br5KEEc (if you compile this with -O2 or -O3 on clang, it will heavily unroll the cleanup loop, even though it can't have more than 7 iterations. Theoretically, you could do the remaining elements with a masked or overlapping store (or maybe you have use-cases where the size is guaranteed to be a multiple of 8 and you can leave it away).
Related
Consider the following toy example, where A is an n x 2 matrix stored in column-major order and I want to compute its column sum. sum_0 only computes sum of the 1st column, while sum_1 does the 2nd column as well. This is really an artificial example, as there is essentially no need to define two functions for this task (I can write a single function with a double loop nest where the outer loop iterates from 0 to j). It is constructed to demonstrate the template problem I have in reality.
/* "test.c" */
#include <stdlib.h>
// j can be 0 or 1
static inline void sum_template (size_t j, size_t n, double *A, double *c) {
if (n == 0) return;
size_t i;
double *a = A, *b = A + n;
double c0 = 0.0, c1 = 0.0;
#pragma omp simd reduction (+: c0, c1) aligned (a, b: 32)
for (i = 0; i < n; i++) {
c0 += a[i];
if (j > 0) c1 += b[i];
}
c[0] = c0;
if (j > 0) c[1] = c1;
}
#define macro_define_sum(FUN, j) \
void FUN (size_t n, double *A, double *c) { \
sum_template(j, n, A, c); \
}
macro_define_sum(sum_0, 0)
macro_define_sum(sum_1, 1)
If I compile it with
gcc -O2 -mavx test.c
GCC (say the latest 8.2), after inlining, constant propagation and dead code elimination, would optimize out code involving c1 for function sum_0 (Check it on Godbolt).
I like this trick. By writing a single template function and passing in different configuration parameters, an optimizing compiler can generate different versions. It is much cleaner than copying-and-pasting a big proportion of the code and manually define different function versions.
However, such convenience is lost if I activate OpenMP 4.0+ with
gcc -O2 -mavx -fopenmp test.c
sum_template is inlined no more and no dead code elimination is applied (Check it on Godbolt). But if I remove flag -mavx to work with 128-bit SIMD, compiler optimization works as I expect (Check it on Godbolt). So is this a bug? I am on an x86-64 (Sandybridge).
Remark
Using GCC's auto-vectorization -ftree-vectorize -ffast-math would not have this issue (Check it on Godbolt). But I wish to use OpenMP because it allows portable alignment pragma across different compilers.
Background
I write modules for an R package, which needs be portable across platforms and compilers. Writing R extension requires no Makefile. When R is built on a platform, it knows what the default compiler is on that platform, and configures a set of default compilation flags. R does not have auto-vectorization flag but it has OpenMP flag. This means that using OpenMP SIMD is the ideal way to utilize SIMD in an R package. See 1 and 2 for a bit more elaboration.
The simplest way to solve this problem is with __attribute__((always_inline)), or other compiler-specific overrides.
#ifdef __GNUC__
#define ALWAYS_INLINE __attribute__((always_inline)) inline
#elif defined(_MSC_VER)
#define ALWAYS_INLINE __forceinline inline
#else
#define ALWAYS_INLINE inline // cross your fingers
#endif
ALWAYS_INLINE
static inline void sum_template (size_t j, size_t n, double *A, double *c) {
...
}
Godbolt proof that it works.
Also, don't forget to use -mtune=haswell, not just -mavx. It's usually a good idea. (However, promising aligned data will stop gcc's default -mavx256-split-unaligned-load tuning from splitting 256-bit loads into 128-bit vmovupd + vinsertf128, so code gen for this function is fine with tune=haswell. But normally you want this for gcc to auto-vectorize any other functions.
You don't really need static along with inline; if a compiler decides not to inline it, it can at least share the same definition across compilation units.
Normally gcc decides to inline or not according to function-size heuristics. But even setting -finline-limit=90000 doesn't get gcc to inline with your #pragma omp (How do I force gcc to inline a function?). I had been guessing that gcc didn't realize that constant-propagation after inlining would simplify the conditional, but 90000 "pseudo-instructions" seems plenty big. There could be other heuristics.
Possibly OpenMP sets some per-function stuff differently in ways that could break the optimizer if it let them inline into other functions. Using __attribute__((target("avx"))) stops that function from inlining into functions compiled without AVX (so you can do runtime dispatching safely, without inlining "infecting" other functions with AVX instructions across if(avx) conditions.)
One thing OpenMP does that you don't get with regular auto-vectorization is that reductions can be vectorized without enabling -ffast-math.
Unfortunately OpenMP still doesn't bother to unroll with multiple accumulators or anything to hide FP latency. #pragma omp is a pretty good hint that a loop is actually hot and worth spending code-size on, so gcc should really do that, even without -fprofile-use.
So especially if this ever runs on data that's hot in L2 or L1 cache (or maybe L3), you should do something to get better throughput.
And BTW, alignment isn't usually a huge deal for AVX on Haswell. But 64-byte alignment does matter a lot more in practice for AVX512 on SKX. Like maybe 20% slowdown for misaligned data, instead of a couple %.
(But promising alignment at compile time is a separate issue from actually having your data aligned at runtime. Both are helpful, but promising alignment at compile time makes tighter code with gcc7 and earlier, or on any compiler without AVX.)
I desperately needed to resolve this issue, because in my real C project, if no template trick were used for auto generation of different function versions (simply called "versioning" hereafter), I would need to write a total of 1400 lines of code for 9 different versions, instead of just 200 lines for a single template.
I was able to find a way out, and am now posting a solution using the toy example in the question.
I planed to utilize an inline function sum_template for versioning. If successful, it occurs at compile time when a compiler performs optimization. However, OpenMP pragma turns out to fail this compile time versioning. The option is then to do versioning at the pre-processing stage using macros only.
To get rid of the inline function sum_template, I manually inline it in the macro macro_define_sum:
#include <stdlib.h>
// j can be 0 or 1
#define macro_define_sum(FUN, j) \
void FUN (size_t n, double *A, double *c) { \
if (n == 0) return; \
size_t i; \
double *a = A, * b = A + n; \
double c0 = 0.0, c1 = 0.0; \
#pragma omp simd reduction (+: c0, c1) aligned (a, b: 32) \
for (i = 0; i < n; i++) { \
c0 += a[i]; \
if (j > 0) c1 += b[i]; \
} \
c[0] = c0; \
if (j > 0) c[1] = c1; \
}
macro_define_sum(sum_0, 0)
macro_define_sum(sum_1, 1)
In this macro-only version, j is directly substituted by 0 or 1 at during macro expansion. Whereas in the inline function + macro approach in the question, I only have sum_template(0, n, a, b, c) or sum_template(1, n, a, b, c) at pre-processing stage, and j in the body of sum_template is only propagated at the later compile time.
Unfortunately, the above macro gives error. I can not define or test a macro inside another (see 1, 2, 3). The OpenMP pragma starting with # is causing problem here. So I have to split this template into two parts: the part before the pragma and the part after.
#include <stdlib.h>
#define macro_before_pragma \
if (n == 0) return; \
size_t i; \
double *a = A, * b = A + n; \
double c0 = 0.0, c1 = 0.0;
#define macro_after_pragma(j) \
for (i = 0; i < n; i++) { \
c0 += a[i]; \
if (j > 0) c1 += b[i]; \
} \
c[0] = c0; \
if (j > 0) c[1] = c1;
void sum_0 (size_t n, double *A, double *c) {
macro_before_pragma
#pragma omp simd reduction (+: c0) aligned (a: 32)
macro_after_pragma(0)
}
void sum_1 (size_t n, double *A, double *c) {
macro_before_pragma
#pragma omp simd reduction (+: c0, c1) aligned (a, b: 32)
macro_after_pragma(1)
}
I no long need macro_define_sum. I can define sum_0 and sum_1 straightaway using the defined two macros. I can also adjust the pragma appropriately. Here instead of having a template function, I have templates for code blocks of a function and can reuse them with ease.
The compiler output is as expected in this case (Check it on Godbolt).
Update
Thanks for the various feedback; they are all very constructive (this is why I love Stack Overflow).
Thanks Marc Glisse for point me to Using an openmp pragma inside #define. Yeah, it was my bad to not have searched this issue. #pragma is an directive, not a real macro, so there must be some way to put it inside a macro. Here is the neat version using the _Pragma operator:
/* "neat.c" */
#include <stdlib.h>
// stringizing: https://gcc.gnu.org/onlinedocs/cpp/Stringizing.html
#define str(s) #s
// j can be 0 or 1
#define macro_define_sum(j, alignment) \
void sum_ ## j (size_t n, double *A, double *c) { \
if (n == 0) return; \
size_t i; \
double *a = A, * b = A + n; \
double c0 = 0.0, c1 = 0.0; \
_Pragma(str(omp simd reduction (+: c0, c1) aligned (a, b: alignment))) \
for (i = 0; i < n; i++) { \
c0 += a[i]; \
if (j > 0) c1 += b[i]; \
} \
c[0] = c0; \
if (j > 0) c[1] = c1; \
}
macro_define_sum(0, 32)
macro_define_sum(1, 32)
Other changes include:
I used token concatenation to generate function name;
alignment is made a macro argument. For AVX, a value of 32 means good alignment, while a value of 8 (sizeof(double)) essentially implies no alignment. Stringizing is required to parse those tokens into strings that _Pragma requires.
Use gcc -E neat.c to inspect pre-processing result. Compilation gives desired assembly output (Check it on Godbolt).
A few comments on Peter Cordes informative answer
Using complier's function attributes. I am not a professional C programmer. My experiences with C come merely from writing R extensions. The development environment determines that I am not very familiar with compiler attributes. I know some, but don't really use them.
-mavx256-split-unaligned-load is not an issue in my application, because I will allocate aligned memory and apply padding to ensure alignment. I just need to promise compiler of the alignment so that it can generate aligned load / store instructions. I do need to do some vectorization on unaligned data, but that contributes to a very limited part of the whole computation. Even if I get a performance penalty on split unaligned load it won't be noticed in reality. I also don't compiler every C file with auto vectorization. I only do SIMD when the operation is hot on L1 cache (i.e., it is CPU-bound not memory-bound). By the way, -mavx256-split-unaligned-load is for GCC; what is it for other compilers?
I am aware of the difference between static inline and inline. If an inline function is only accessed by one file, I will declare it as static so that compiler does not generate a copy of it.
OpenMP SIMD can do reduction efficiently even without GCC's -ffast-math. However, it does not use horizontal addition to aggregate results inside the accumulator register in the end of the reduction; it runs a scalar loop to add up each double word (see code block .L5 and .L27 in Godbolt output).
Throughput is a good point (especially for floating-point arithmetics which has relatively big latency but high throughput). My real C code where SIMD is applied is a triple loop nest. I unroll outer two loops to enlarge the code block in the innermost loop to enhance throughput. Vectorization of the innermost one is then sufficient. With the toy example in this Q & A where I just sum an array, I can use -funroll-loops to ask GCC for loop unrolling, using several accumulators to enhance throughput.
On this Q & A
I think most people would treat this Q & A in a more technical way than me. They might be interested in using compiler attributes or tweaking compiler flags / parameters to force function inlining. Therefore, Peter's answer as well as Marc's comment under the answer is still very valuable. Thanks again.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
void sort();
int main() {
int i;
for (i = 0; i < 100000; i++) {
sort();
}
}
void sort() {
int i, j, k, array[100], l = 99, m;
for (i = 0; i < 100; i++) {
array[i] = rand() % 1000 + 1;
}
for (k = 0; k < 99; k++) {
for (j = 0; j < l; j++) {
if (array[j + 1] > array[j]) {
int temp = array[j];
array[j] = array[j + 1];
array[j + 1] = temp;
}
}
l--;
}
for (m = 0; m < 100; m++) {
printf("%d ", array[m]);
}
}
On the linux shell, gcc sort -o sort.c and then time ./sort >> out.
Here if I do gcc -o2 sort -o sort.c and similarly o3 and o4 then the time keeps on decreasing. How does the optimization options work? Please explain in terms of all real time, user time and system time.
PS: The code might be a little inefficient. Kindly ignore that.
Optimization options work between the reading of the source code and the writing of the binary instructions to the CPU.
GCC is a multi-phase compiler, where the phases roughly consist of:
Creating "tokens" from the input text.
Arranging those tokens into abstract syntax tree structure.
Pruning the abstract syntax tree.
Creating register based instructions, assuming an infinite number of CPU registers.
Mapping the registers into the actual registers available.
Writing the binary information out, in the loader's expected format.
Optimizations can impact a number of locations, typically they become active in the above mentioned steps 3 through 5. There are many optimizations, including:
Constant folding – Evaluate constant subexpressions in advance.
Strength reduction – Replace slow operations with faster equivalents.
Null sequences – Delete useless operations.
Combine operations – Replace several operations with one equivalent.
Algebraic laws – Use algebraic laws to simplify or reorder instructions.
Special case instructions – Use instructions designed for special operand cases.
Address mode operations – Use address modes to simplify code.
Loop unrolling - Replace a loop with equivalent instructions
Partial loop unrolling - Reduce times a loop is evaluated while preserving overall function.
Note that these are not all the optimizations that might be performed, but it starts to give you an idea.
For example, if the compiler sees
int s = 3;
while (s < 6) {
printf("%d\n", s);
s++;
}
and the flags are set to unroll loops, then it might write CPU instructions equivalent to
printf("%d\n", 3);
printf("%d\n", 4);
printf("%d\n", 5);
Those instructions might seem more wordy to us humans, but the CPU commands might be smaller, because there is no need to "lookup" the now-erased value of s, nor is there the need to add one to it, or store the new updated value back into RAM.
GCC arranges the optimizations into categories, ranging from "safe" to "risky". -O2 is a good compromise between speed and safety. Higher -O numbers are riskier.
The -O compiler flag controls the amount of compiler optimization that you wish the compiler to perform. In short, building the project will take longer but the resulting executable should be faster. For more information, type man gcc into the command prompt or gcc -c -Q -O3 --help=optimizers for specific information regarding the optimizations performed for a particular flag.
-O stands for optimize, in which gcc will automatically take the steps necessary to optimize your program. You can read more about the specific steps that GCC takes to optimize your program here: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
But essentially, -O2 is more optimized than -O1, and -O3 more than -O2. This might come with drawbacks in regard to compiled binary size, where the resulting binary could use more space, but run faster, and vice versa. You can actually paste your code into https://godbolt.org/, and write in -O1 or any of the optimization options beside the dropdown to choose a compiler, and godbolt will show you what the resulting code looks like. You will be able to see a difference between O1 and O2, namely, the O2 generated code is probably shorter and will use a lot of shortcuts to do your algorithm.
gcc offers a number of optimization flags. You can see what each one does specifically here:
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
There's always a tradeoff with optimizations, either by increased compile time, increased use of memory, etc...
There are dozens of optimizations enabled by the -o2 flag, so it might not be immediately clear which specific ones affect the sorting. Instead of -o2, you can try each optimization individually, for example using the -falign-loops flag, to see whether that is the one providing the performance increase.
I am doing a benchmark about vectorization on MacOS with the following processor i7 :
$ sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i7-4960HQ CPU # 2.60GHz
My MacBook Pro is from middle 2014.
I tried to use different flag options for vectorization : the 3 ones that interest me are SSE, AVX and AVX2.
For my benchmark, I add each element of 2 arrays and store the sum in a third array.
I must make you notice that I am working with double type for these arrays.
Here are the functions used into my benchmark code :
1*) First with SSE vectorization :
#ifdef SSE
#include <x86intrin.h>
#define ALIGN 16
void addition_tab(int size, double *a, double *b, double *c)
{
int i;
// Main loop
for (i=size-1; i>=0; i-=2)
{
// Intrinsic SSE syntax
const __m128d x = _mm_load_pd(a); // Load two x elements
const __m128d y = _mm_load_pd(b); // Load two y elements
const __m128d sum = _mm_add_pd(x, y); // Compute two sum elements
_mm_store_pd(c, sum); // Store two sum elements
// Increment pointers by 2 since SSE vectorizes on 128 bits = 16 bytes = 2*sizeof(double)
a += 2;
b += 2;
c += 2;
}
}
#endif
2*) Second with AVX256 vectorization :
#ifdef AVX256
#include <immintrin.h>
#define ALIGN 32
void addition_tab(int size, double *a, double *b, double *c)
{
int i;
// Main loop
for (i=size-1; i>=0; i-=4)
{
// Intrinsic AVX syntax
const __m256d x = _mm256_load_pd(a); // Load two x elements
const __m256d y = _mm256_load_pd(b); // Load two y elements
const __m256d sum = _mm256_add_pd(x, y); // Compute two sum elements
_mm256_store_pd(c, sum); // Store two sum elements
// Increment pointers by 4 since AVX256 vectorizes on 256 bits = 32 bytes = 4*sizeof(double)
a += 4;
b += 4;
c += 4;
}
}
#endif
For SSE vectorization, I expect a Speedup equal around 2 because I align data on 128bits = 16 bytes = 2* sizeof(double).
What I get in results for SSE vectorization is represented on the following figure :
So, I think these results are valid because SpeedUp is around factor 2.
Now for AVX256, I get the following figure :
For AVX256 vectorization, I expect a Speedup equal around 4 because I align data on 256bits = 32 bytes = 4* sizeof(double).
But as you can see, I still get a factor 2 and not 4 for SpeedUp.
I don't understand why I get the same results for Speedup with SSE and AVX
vectorization.
Does it come from "compilation flags", from my model of processor, ... I don't know.
Here are the compilation command line that I have done for all above results :
For SSE :
gcc-mp-4.9 -DSSE -O3 -msse main_benchmark.c -o vectorizedExe
For AVX256 :
gcc-mp-4.9 -DAVX256 -O3 -Wa,-q -mavx main_benchmark.c -o vectorizedExe
Moreover, with my model of processor, could I use AVX512 vectorization ? (Once the issue of this question will be solved).
Thanks for your help
UPDATE 1
I tried the different options of #Mischa but still can't get a factor 4 for speedup with AVX flags and option. You can take a look at my C source on http://example.com/test_vectorization/main_benchmark.c.txt (with .txt extension for direct view into browser) and the shell script for benchmarking is http://example.com/test_vectorization/run_benchmark .
As said #Mischa, I try to apply the following commande line for compilation :
$GCC -O3 -Wa,-q -mavx -fprefetch-loop-arrays main_benchmark.c -o
vectorizedExe
but code genereated has not AVX instructions.
if you could you take a look at these files, this would be great. Thanks.
You are hitting the wall for cache->ram transfer. Your core7 has a 64 byte cache line. For sse2, 16 byte store requires a 64 byte load, update, and queue back to ram. 16 byte loads in ascending order benefit from automatic prefetch prediction, so you get some load benefit. Add mm_prefetch of destination memory; say, 256 bytes ahead of the next store. Same applies to avx2 32-byte stores.
NP. There are options:
(1) x86-specific code:
#include <emmintrin.h>
...
for (int i=size; ...) {
_mm_prefetch(256+(char*)c, _MM_HINT_T0);
...
_mm256_store_pd(c, sum);
(2) gcc-specific code:
for (int i=size; ...) {
__builtin_prefetch(c+32);
...
(3) gcc -fprefetch-array-loops --- the compiler knows best.
(3) is the best if your version of gcc supports it.
(2) is next-best, if you compile and run on same hardware.
(1) is portable to other compilers.
"256", unfortunately, is a guestimate, and hardware-dependent. 128 is a minimum, 512 a maximum, depending on your CPU:RAM speed. If you switch to _mm512*(), double those numbers.
If you are working across a range of processors, may I suggest compiling in a way that covers all cases, then test cpuid(ax=0)>=7, then cpuid(ax=7,cx=0):bx & 0x04000010 at runtime (0x10 for AVX2, 0x04000000 for AVX512 incl prefetch).
BTW if you are using gcc and specifying -mavx or -msse2, the compiler defines builtin macros __AVX__ or __SSE2__ for you; no need for -DAVX256. To support archaic 32-bit processors, -m32 unfortunately disables __SSE2__ hence effectively disables \#include <emmintrin.h> :-P
HTH
I tried compiling this code,
#include <stdlib.h>
struct rgb {
int r, g, b;
};
void adjust_brightness(struct rgb *picdata, size_t len, int adjustment) {
// assume adjustment is between 0 and 255.
for (int i = 0; i < len; i++) {
picdata[i].r += adjustment;
picdata[i].g += adjustment;
picdata[i].b += adjustment;
}
}
on OSX using this command,
$ cc -Rpass-analysis=loop-vectorize -c -std=c99 -O3 brightness.c
brightness.c:13:3: remark: loop not vectorized: unsafe dependent memory operations in loop [-Rpass-analysis=loop-vectorize]
for (int i = 0; i < len; i++) {
^
Can someone explain what is unsafe and dependent here? I'm learning about SIMD, and this was explained at the most obvious use for SIMD. I was hoping to learn how the compiler would generated SIMD instructions for a simple example. In my head, I expect the compiler to maybe instead of incrementing by 1, it would increment by enough to put the loop body into vector registers?
Do I misunderstand?
Question
I am testing a simple code which calculates Mandelbrot fractal. I have been checking its performance depending on the number of iterations in the function that checks if a point belongs to the Mandelbrot set or not.
The surprising thing is that I am getting a big difference in times after adding the -fPIC flag. From what I read the overhead is usually negligible and the highest overhead I came across was about 6%. I measured around 30% overhead. Any advice will be appreciated!
Details of my project
I use the -O3 flag, gcc 4.7.2, Ubuntu 12.04.2, x86_64.
The results look as follow
#iter C (fPIC) C C/C(fPIC)
1 0.01 0.01 1.00
100 0.04 0.03 0.75
200 0.06 0.04 0.67
500 0.15 0.1 0.67
1000 0.28 0.19 0.68
2000 0.56 0.37 0.66
4000 1.11 0.72 0.65
8000 2.21 1.47 0.67
16000 4.42 2.88 0.65
32000 8.8 5.77 0.66
64000 17.6 11.53 0.66
Commands I use:
gcc -O3 -fPIC fractalMain.c fractal.c -o ffpic
gcc -O3 fractalMain.c fractal.c -o f
Code: fractalMain.c
#include <time.h>
#include <stdio.h>
#include <stdbool.h>
#include "fractal.h"
int main()
{
int iterNumber[] = {1, 100, 200, 500, 1000, 2000, 4000, 8000, 16000, 32000, 64000};
int it;
for(it = 0; it < 11; ++it)
{
clock_t start = clock();
fractal(iterNumber[it]);
clock_t end = clock();
double millis = (end - start)*1000 / CLOCKS_PER_SEC/(double)1000;
printf("Iter: %d, time: %lf \n", iterNumber[it], millis);
}
return 0;
}
Code: fractal.h
#ifndef FRACTAL_H
#define FRACTAL_H
void fractal(int iter);
#endif
Code: fractal.c
#include <stdio.h>
#include <stdbool.h>
#include "fractal.h"
void multiplyComplex(double a_re, double a_im, double b_re, double b_im, double* res_re, double* res_im)
{
*res_re = a_re*b_re - a_im*b_im;
*res_im = a_re*b_im + a_im*b_re;
}
void sqComplex(double a_re, double a_im, double* res_re, double* res_im)
{
multiplyComplex(a_re, a_im, a_re, a_im, res_re, res_im);
}
bool isInSet(double P_re, double P_im, double C_re, double C_im, int iter)
{
double zPrev_re = P_re;
double zPrev_im = P_im;
double zNext_re = 0;
double zNext_im = 0;
double* p_zNext_re = &zNext_re;
double* p_zNext_im = &zNext_im;
int i;
for(i = 1; i <= iter; ++i)
{
sqComplex(zPrev_re, zPrev_im, p_zNext_re, p_zNext_im);
zNext_re = zNext_re + C_re;
zNext_im = zNext_im + C_im;
if(zNext_re*zNext_re+zNext_im*zNext_im > 4)
{
return false;
}
zPrev_re = zNext_re;
zPrev_im = zNext_im;
}
return true;
}
bool isMandelbrot(double P_re, double P_im, int iter)
{
return isInSet(0, 0, P_re, P_im, iter);
}
void fractal(int iter)
{
int noIterations = iter;
double xMin = -1.8;
double xMax = 1.6;
double yMin = -1.3;
double yMax = 0.8;
int xDim = 512;
int yDim = 384;
double P_re, P_im;
int nop;
int x, y;
for(x = 0; x < xDim; ++x)
for(y = 0; y < yDim; ++y)
{
P_re = (double)x*(xMax-xMin)/(double)xDim+xMin;
P_im = (double)y*(yMax-yMin)/(double)yDim+yMin;
if(isMandelbrot(P_re, P_im, noIterations))
nop = x+y;
}
printf("%d", nop);
}
Story behind the comparison
It might look a bit artificial to add the -fPIC flag when building executable (as per one of the comments). So a few words of explanation: first I only compiled the program as executable and wanted to compare to my Lua code, which calls the isMandelbrot function from C. So I created a shared object to call it from lua - and had big time differences. But couldn't understand why they were growing with number of iterations. In the end found out that it was because of the -fPIC. When I create a little c program which calls my lua script (so effectively I do the same thing, only don't need the .so) - the times are very similar to C (without -fPIC). So I have checked it in a few configurations over the last few days and it consistently shows two sets of very similar results: faster without -fPIC and slower with it.
It turns out that when you compile without the -fPIC option multiplyComplex, sqComplex, isInSet and isMandelbrot are inlined automatically by the compiler. If you define those functions as static you will likely get the same performance when compiling with -fPIC because the compiler will be free to perform inlining.
The reason why the compiler is unable to automatically inline the helper functions has to do with symbol interposition. Position independent code is required to access all global data indirectly, i.e. through the global offset table. The very same constraint applies to function calls, which have to go through the procedure linkage table. Since a symbol might get interposed by another one at runtime (see LD_PRELOAD), the compiler cannot simply assume that it is safe to inline a function with global visibility.
The very same assumption can be made if you compile without -fPIC, i.e. the compiler can safely assume that a global symbol defined in the executable cannot be interposed because the lookup scope begins with the executable itself which is then followed by all other libraries, including the preloaded ones.
For a more thorough understanding have a look at the following paper.
As other people already pointed out -fPIC forces GCC to disable many optimizations e.g. inlining and cloning. I'd like to point out several ways to overcome this:
replace -fPIC with -fPIE if you are compiling main executable (not libraries) as this allows compiler to assume that interposition is not possible;
use -fvisibility=hidden and __attribute__((visibility("default"))) to export only necessary functions from the library and hide the rest; this would allow GCC to optimize hidden functions more aggressively;
use private symbol aliases (__attribute__((alias ("__f")));) to refer to library functions from within the library; this would again untie GCC's hands
previous suggestion can be automated with -fno-semantic-interposition flag that was added in recent GCC versions
It's interesting to note that Clang is different from GCC as it allows all optimizations by default regardless of -fPIC (can be overridden with -fsemantic-interposition to obtain GCC-like behavior).
As others have discussed in the comment section of your opening post, compiling with -flto should help to reduce the difference in run-times you are seeing for this particular case, since the link time optimisations of gcc will likely figure out that it's actually ok to inline a couple of functions ;)
In general, link time optimisations could lead to massive reductions in code size (~6%) link to paper on link time optimisations in gold, and thus run time as well (more of your program fits in the cache). Also note that -fPIC is mostly viewed as a feature that enables tighter security and is always enabled in android. This question on SO briefly discusses as well. Also, just to let you know, -fpic is the faster version of -fPIC, so if you must use -fPIC try -fpic instead - link to gcc docs. For x86 it might not make a difference, but you need to check this for yourself/ask on gcc-help.