Branchless conditionals on integers — fast, but can they be made faster? - c

I've been experimenting with the following and have noticed that the branchless “if” defined here (now with &-!! replacing *!!) can speed up certain bottleneck code by as much as (almost) 2x on 64-bit Intel targets with clang:
// Produces x if f is true, else 0 if f is false.
#define BRANCHLESS_IF(f,x) ((x) & -((typeof(x))!!(f)))
// Produces x if f is true, else y if f is false.
#define BRANCHLESS_IF_ELSE(f,x,y) (((x) & -((typeof(x))!!(f))) | \
((y) & -((typeof(y)) !(f))))
Note that f should be a reasonably simple expression with no side-effects, so that the compiler is able to do its best optimizations.
Performance is highly dependent on CPU and compiler. The branchless ‘if’ performance is excellent with clang; I haven't found any cases yet where the branchless ‘if/else’ is faster, though.
My question is: are these safe and portable as written (meaning guaranteed to give correct results on all targets), and can they be made faster?
Example usage of branchless if/else
These compute 64-bit minimum and maximum.
inline uint64_t uint64_min(uint64_t a, uint64_t b)
return BRANCHLESS_IF_ELSE((a <= b), a, b);
inline uint64_t uint64_max(uint64_t a, uint64_t b)
return BRANCHLESS_IF_ELSE((a >= b), a, b);
Example usage of branchless if
This is 64-bit modular addition — it computes (a + b) % n. The branching version (not shown) suffers terribly from branch prediction failures, but the branchless version is very fast (at least with clang).
inline uint64_t uint64_add_mod(uint64_t a, uint64_t b, uint64_t n)
assert(n > 1); assert(a < n); assert(b < n);
uint64_t c = a + b - BRANCHLESS_IF((a >= n - b), n);
assert(c < n);
return c;
Update: Full concrete working example of branchless if
Below is a full working C11 program that demonstrates the speed difference between branching and a branchless versions of a simple if conditional, if you would like to try it on your system. The program computes modular exponentiation, that is (a ** b) % n, for extremely large values.
To compile, use the following on the command line:
-O3 (or whatever high optimization level you prefer)
-DNDEBUG (to disable assertions, for speed)
Either -DBRANCHLESS=0 or -DBRANCHLESS=1 to specify branching or branchless behavior, respectively
On my system, here's what happens:
$ cc -DBRANCHLESS=0 -DNDEBUG -O3 -o powmod powmod.c && ./powmod
CPU time: 21.83 seconds
foo = 10585369126512366091
$ cc -DBRANCHLESS=1 -DNDEBUG -O3 -o powmod powmod.c && ./powmod
CPU time: 11.76 seconds
foo = 10585369126512366091
$ cc --version
Apple LLVM version 6.0 (clang-600.0.57) (based on LLVM 3.5svn)
Target: x86_64-apple-darwin14.1.0
Thread model: posix
So, the branchless version is almost twice as fast as the branching version on my system (3.4 GHz. Intel Core i7).
#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>
#include <time.h>
#include <assert.h>
typedef uint64_t uint64;
// Actually branchless.
#define BRANCHLESS_IF(f,x) ((x) & -((typeof(x))!!(f)))
#define BRANCHLESS_IF_ELSE(f,x,y) (((x) & -((typeof(x))!!(f))) | \
((y) & -((typeof(y)) !(f))))
// Not actually branchless, but used for comparison.
#define BRANCHLESS_IF(f,x) ((f)? (x) : 0)
#define BRANCHLESS_IF_ELSE(f,x,y) ((f)? (x) : (y))
// 64-bit modular multiplication. Computes (a * b) % n without division.
static uint64 uint64_mul_mod(uint64 a, uint64 b, const uint64 n)
assert(n > 1); assert(a < n); assert(b < n);
if (a < b) { uint64 t = a; a = b; b = t; } // Ensure that b <= a.
uint64 c = 0;
for (; b != 0; b /= 2)
// This computes c = (c + a) % n if (b & 1).
c += BRANCHLESS_IF((b & 1), a - BRANCHLESS_IF((c >= n - a), n));
assert(c < n);
// This computes a = (a + a) % n.
a += a - BRANCHLESS_IF((a >= n - a), n);
assert(a < n);
assert(c < n);
return c;
// 64-bit modular exponentiation. Computes (a ** b) % n using modular
// multiplication.
uint64 uint64_pow_mod(uint64 a, uint64 b, const uint64 n)
assert(n > 1); assert(a < n);
uint64 c = 1;
for (; b > 0; b /= 2)
if (b & 1)
c = uint64_mul_mod(c, a, n);
a = uint64_mul_mod(a, a, n);
assert(c < n);
return c;
int main(const int argc, const char *const argv[const])
printf("BRANCHLESS = %d\n", BRANCHLESS);
clock_t clock_start = clock();
#define SHOW_RESULTS 0
uint64 foo = 0; // Used in forcing compiler not to throw away results.
uint64 n = 3, a = 1, b = 1;
const uint64 iterations = 1000000;
for (uint64 iteration = 0; iteration < iterations; iteration++)
uint64 c = uint64_pow_mod(a%n, b, n);
printf("(%"PRIu64" ** %"PRIu64") %% %"PRIu64" = %"PRIu64"\n",
a%n, b, n, c);
foo ^= c;
n = n * 3 + 1;
a = a * 5 + 3;
b = b * 7 + 5;
clock_t clock_end = clock();
double elapsed = (double)(clock_end - clock_start) / CLOCKS_PER_SEC;
printf("CPU time: %.2f seconds\n", elapsed);
printf("foo = %"PRIu64"\n", foo);
return 0;
Second update: Intel vs. ARM performance
Testing on 32-bit ARM targets (iPhone 3GS/4S, iPad 1/2/3/4, as compiled by Xcode 6.1 with clang) reveals that the branchless “if” here is actually about 2–3 times slower than ternary ?: for the modular exponentiation code in those cases. So it seems that these branchless macros are not a good idea if maximum speed is needed, although they might be useful in rare cases where constant speed is needed.
On 64-bit ARM targets (iPhone 6+, iPad 5), the branchless “if” runs the same speed as ternary ?: — again as compiled by Xcode 6.1 with clang.
For both Intel and ARM (as compiled by clang), the branchless “if/else” was about twice as slow as ternary ?: for computing min/max.

Sure this is portable, the ! operator is guaranteed to give either 0 or 1 as a result. This then is promoted to whatever type is needed by the other operand.
As others observed, your if-else version has the disadvantage to evaluate twice, but you already know that, and if there is no side effect you are fine.
What surprises me is that you say that this is faster. I would have thought that modern compilers perform that sort of optimization themselves.
Edit: So I tested this with two compilers (gcc and clang) and the two values for the configuration.
In fact, if you don't forget to set -DNDEBUG=1, the 0 version with ?: is much better for gcc and does what I would have it expected to do. It basically uses conditional moves to have the loop branchless. In that case clang doesn't find this sort of optimization and does some conditional jumps.
For the version with arithmetic, the performance for gcc worsens. In fact seeing what he does this is not surprising. It really uses imul instructions, and these are slow. clang gets off better here. The "arithmetic" actually has optimized the multiplication out and replaced them by conditional moves.
So to summarize, yes, this is portable, but if this brings performance improvement or worsening will depend on your compiler, its version, the compile flags that you are applying, the potential of your processor ...


How to use AVX instructions to optimize ReLU written in C

I have the following simplified ReLU simulation code that I am trying to optimize. The code uses a ternary operation which is perhaps coming in the way of automatic vectorization by the compiler. How can I vectorize this code?
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <mkl.h>
void relu(double* &Amem, double* Z_curr, int bo)
for (int i=0; i<bo; ++i) {
Amem[i] = Z_curr[i] > 0 ? Z_curr[i] : Z_curr[i]*0.1;
int main()
int i, j;
int batch_size = 16384;
int output_dim = 21;
// double* Amem = new double[batch_size*output_dim];
// double* Z_curr = new double[batch_size*output_dim];
double* Amem = (double *)mkl_malloc(batch_size*output_dim*sizeof( double ), 64 );
double* Z_curr = (double *)mkl_malloc(batch_size*output_dim*sizeof( double ), 64 );
memset(Amem, 0, sizeof(double)*batch_size*output_dim);
for (i=0; i<batch_size*output_dim; ++i) {
Z_curr[i] = -1+2*((double)rand())/RAND_MAX;
relu(Amem, Z_curr, batch_size*output_dim);
To compile it, if you have MKL then use the following, otherwise plain g++ -O3.
g++ -O3 ex.cxx -L${MKLROOT}/lib/intel64 -lmkl_intel_ilp64 -lmkl_intel_thread -lmkl_core -liomp5
So far, I have tried adding -march=skylake-avx512 as a compiler option, but it does not vectorize the loop as I found using option -fopt-info-vec-all for compilation:
ex.cxx:9:16: missed: couldn't vectorize loop
ex.cxx:9:16: missed: not vectorized: control flow in loop.
ex.cxx:6:6: note: vectorized 0 loops in function.
ex.cxx:9:16: missed: couldn't vectorize loop
ex.cxx:9:16: missed: not vectorized: control flow in loop.
and this is the time it takes currently at my end:
time ./a.out
real 0m0.034s
user 0m0.026s
sys 0m0.009s
There is usually no benefit to pass a pointer by reference (unless you want to modify the pointer itself). Furthermore, you can help your compiler using the (non-standard) __restrict keyword, telling it that no aliasing happens between input and output (of course, this will likely give wrong results, if e.g., Amem == Z_curr+1 -- but Amem == Z_curr should (in this case) be fine).
void relu(double* __restrict Amem, double* Z_curr, int bo)
Using that alone, clang actually is capable of vectorizing your loop using vcmpltpd and masked moves (for some reasons, only using 256bit registers).
If you simplify your expression to std::max(Z_curr[i], 0.1*Z_curr[i]) even gcc easily is capable of vectorizing it:
Generally, I would suggest compiling crucial routines of your code with different compilers and different compile options (sometimes trying -ffast-math can show you ways to simplify your expressions) and have a look at the generated code. For portability you could then re-translate the generated code into intrinsics (or leave it as is, if every compiler you care about gives good-enough results).
For completeness, here is a possible manually vectorized implementation using intrinsics:
void relu_avx512(double* __restrict Amem, double* Z_curr, int bo)
int i;
for (i=0; i<=bo-8; i+=8)
__m512d z = _mm512_loadu_pd(Z_curr+i);
__mmask8 positive = _mm512_cmplt_pd_mask (_mm512_setzero_pd(), z);
__m512d res = _mm512_mask_mul_pd(z, positive, z, _mm512_set1_pd(0.9));
_mm512_storeu_pd(Amem+i, res);
// remaining elements in scalar loop
for (; i<bo; ++i) {
Amem[i] = 0.0 < Z_curr[i] ? Z_curr[i] : Z_curr[i]*0.1;;
Godbolt: (if you compile this with -O2 or -O3 on clang, it will heavily unroll the cleanup loop, even though it can't have more than 7 iterations. Theoretically, you could do the remaining elements with a masked or overlapping store (or maybe you have use-cases where the size is guaranteed to be a multiple of 8 and you can leave it away).

Why does this code beat rint() and how to I protect it from -ffast-math and friends?

I am looking to find a way to protect some code work from -ffast-math (or msvc/icc equivalents, etc) that works across C compilers.
My inner loop is searching data for numbers that are close to integer values (e.g within ~0.1). Data values are signed, typically less than a few thousand with no inf/nan. The fastest version I found uses a trick with a large magic number:
remainder = h - ((h+MAGIC)-MAGIC) ;
Does someone have ideas for a way to keep the priority order of the brackets for the key line above? This seems to beat rint(x) by a factor of 3, so I am kind of curious as to why it is working anyway. Could it it be something to do with vectorisation?
Most compilers "simplify" the expression when using -ffast-math or equivalent and it stops working. I want to keep the performance (3X is quite a lot) but also keep it vaguely portable (given MAGIC depends on having the right ieee). If I add a volatile then it slows down but seems to give right answers with fast-math but is then slower than rint:
volatile t = h+MAGIC; t-=MAGIC;
remainder = h - t;
A complete example is below. I tried some gcc things like __attribute__((optimize("-fno-associative-math"))) but it doesn't seem like the right approach to eventually work for icc/gcc/msvc/clang etc. The related C99 standard pragma's don't seem to be widely available either.
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <sys/time.h>
/* */
union i_cast {double d; int i[2];};
#define MAGIC 6755399441055744.0
/* x86_64 for me today */
#define ENDIANLOC 0
{volatile union i_cast u; u.d = (h) + MAGIC; r = h - u.i[ENDIANLOC]; }
#define REMAINDER_MAGIC r=(h - ((h+MAGIC)-MAGIC));
#define REMAINDER_RINT r=(h - rint(h));
#define REMAINDER_TRUNC r=(h - ( (h>0) ? ((int)(h+0.5)) : ((int)(h-0.5))) );
#define REMAINDER_FLOOR r=(h - floor(h+0.5));
#define REMAINDER_REMAIN r=(remainder(h, 1.0));
#define REMAINDER_ROUND r=(h - round(h));
#define REMAINDER_NEARBY r=(h - nearbyint(h));
#define block(MACRO) { \
for(i=0 ; i<3 ; i++){ \
gettimeofday(&start, NULL); \
n = 0; \
for (k = 0; k < ng; k++) { \
h = mul * gv[k]; \
if ( (r*r) < tol ) n++; \
} \
gettimeofday(&end, NULL); \
double dt = (double)(end.tv_sec - start.tv_sec); \
dt += (1e-6)*(double)(end.tv_usec - start.tv_usec); \
if(i==2) \
printf("%20s %d indexed in %lf s %f ns/value\n",#MACRO, \
n,dt,1e9*dt/ng); \
} \
int main(){
struct timeval start, end;
// Make some test data
double h, r, tol = 0.02, mul = 123.4;
int i, n, k, ng = 1024*1024*32;
double *gv = (double *) malloc(ng*sizeof(double));
for(int i=0;i<ng;i++) { gv[i] = ((double)rand())/RAND_MAX*2.-1.; }
// Measure some timing
free( gv );
return 0;
For me, today, the output was like this with gcc -O3:
REMAINDER_MAGIC 9489537 indexed in 0.017953 s 0.535041 ns/value
REMAINDER_LUA 9489537 indexed in 0.048870 s 1.456439 ns/value
REMAINDER_RINT 9489537 indexed in 0.050894 s 1.516759 ns/value
REMAINDER_FLOOR 9489537 indexed in 0.086768 s 2.585888 ns/value
REMAINDER_TRUNC 9489537 indexed in 0.162564 s 4.844785 ns/value
REMAINDER_ROUND 9489537 indexed in 0.417856 s 12.453079 ns/value
REMAINDER_REMAIN 9489537 indexed in 0.517612 s 15.426040 ns/value
REMAINDER_NEARBY 9489537 indexed in 0.786896 s 23.451328 ns/value
Perhaps some other language (rust/go/opencl/whatever) would do better than C here? Or it is just better to control the compiler flags and add a runtime test in the code for correctness?
There's no standard way to control this nonstandard behavior. Every compiler with a -ffast-math-style option has attributes and pragmas to control that, but those differ, as do the precise effects of the option. For that matter, different versions of some of these compilers have remarkably different fast-math behavior, so it's not merely a matter of a collection of appropriate pragmas. The standard way to get standard behavior is to make the compiler follow the language standard.
-ffast-math and its ilk are intended primarily for programmers who don't care about the details of floating point math, and just want their programs (which make limited and conservative use of FP) to run faster. Most of the useful effects of -ffast-math can be duplicated with carefully written code in any case.

256-bit vectorization via OpenMP SIMD prevents compiler's optimization (say function inlining)?

Consider the following toy example, where A is an n x 2 matrix stored in column-major order and I want to compute its column sum. sum_0 only computes sum of the 1st column, while sum_1 does the 2nd column as well. This is really an artificial example, as there is essentially no need to define two functions for this task (I can write a single function with a double loop nest where the outer loop iterates from 0 to j). It is constructed to demonstrate the template problem I have in reality.
/* "test.c" */
#include <stdlib.h>
// j can be 0 or 1
static inline void sum_template (size_t j, size_t n, double *A, double *c) {
if (n == 0) return;
size_t i;
double *a = A, *b = A + n;
double c0 = 0.0, c1 = 0.0;
#pragma omp simd reduction (+: c0, c1) aligned (a, b: 32)
for (i = 0; i < n; i++) {
c0 += a[i];
if (j > 0) c1 += b[i];
c[0] = c0;
if (j > 0) c[1] = c1;
#define macro_define_sum(FUN, j) \
void FUN (size_t n, double *A, double *c) { \
sum_template(j, n, A, c); \
macro_define_sum(sum_0, 0)
macro_define_sum(sum_1, 1)
If I compile it with
gcc -O2 -mavx test.c
GCC (say the latest 8.2), after inlining, constant propagation and dead code elimination, would optimize out code involving c1 for function sum_0 (Check it on Godbolt).
I like this trick. By writing a single template function and passing in different configuration parameters, an optimizing compiler can generate different versions. It is much cleaner than copying-and-pasting a big proportion of the code and manually define different function versions.
However, such convenience is lost if I activate OpenMP 4.0+ with
gcc -O2 -mavx -fopenmp test.c
sum_template is inlined no more and no dead code elimination is applied (Check it on Godbolt). But if I remove flag -mavx to work with 128-bit SIMD, compiler optimization works as I expect (Check it on Godbolt). So is this a bug? I am on an x86-64 (Sandybridge).
Using GCC's auto-vectorization -ftree-vectorize -ffast-math would not have this issue (Check it on Godbolt). But I wish to use OpenMP because it allows portable alignment pragma across different compilers.
I write modules for an R package, which needs be portable across platforms and compilers. Writing R extension requires no Makefile. When R is built on a platform, it knows what the default compiler is on that platform, and configures a set of default compilation flags. R does not have auto-vectorization flag but it has OpenMP flag. This means that using OpenMP SIMD is the ideal way to utilize SIMD in an R package. See 1 and 2 for a bit more elaboration.
The simplest way to solve this problem is with __attribute__((always_inline)), or other compiler-specific overrides.
#ifdef __GNUC__
#define ALWAYS_INLINE __attribute__((always_inline)) inline
#elif defined(_MSC_VER)
#define ALWAYS_INLINE __forceinline inline
#define ALWAYS_INLINE inline // cross your fingers
static inline void sum_template (size_t j, size_t n, double *A, double *c) {
Godbolt proof that it works.
Also, don't forget to use -mtune=haswell, not just -mavx. It's usually a good idea. (However, promising aligned data will stop gcc's default -mavx256-split-unaligned-load tuning from splitting 256-bit loads into 128-bit vmovupd + vinsertf128, so code gen for this function is fine with tune=haswell. But normally you want this for gcc to auto-vectorize any other functions.
You don't really need static along with inline; if a compiler decides not to inline it, it can at least share the same definition across compilation units.
Normally gcc decides to inline or not according to function-size heuristics. But even setting -finline-limit=90000 doesn't get gcc to inline with your #pragma omp (How do I force gcc to inline a function?). I had been guessing that gcc didn't realize that constant-propagation after inlining would simplify the conditional, but 90000 "pseudo-instructions" seems plenty big. There could be other heuristics.
Possibly OpenMP sets some per-function stuff differently in ways that could break the optimizer if it let them inline into other functions. Using __attribute__((target("avx"))) stops that function from inlining into functions compiled without AVX (so you can do runtime dispatching safely, without inlining "infecting" other functions with AVX instructions across if(avx) conditions.)
One thing OpenMP does that you don't get with regular auto-vectorization is that reductions can be vectorized without enabling -ffast-math.
Unfortunately OpenMP still doesn't bother to unroll with multiple accumulators or anything to hide FP latency. #pragma omp is a pretty good hint that a loop is actually hot and worth spending code-size on, so gcc should really do that, even without -fprofile-use.
So especially if this ever runs on data that's hot in L2 or L1 cache (or maybe L3), you should do something to get better throughput.
And BTW, alignment isn't usually a huge deal for AVX on Haswell. But 64-byte alignment does matter a lot more in practice for AVX512 on SKX. Like maybe 20% slowdown for misaligned data, instead of a couple %.
(But promising alignment at compile time is a separate issue from actually having your data aligned at runtime. Both are helpful, but promising alignment at compile time makes tighter code with gcc7 and earlier, or on any compiler without AVX.)
I desperately needed to resolve this issue, because in my real C project, if no template trick were used for auto generation of different function versions (simply called "versioning" hereafter), I would need to write a total of 1400 lines of code for 9 different versions, instead of just 200 lines for a single template.
I was able to find a way out, and am now posting a solution using the toy example in the question.
I planed to utilize an inline function sum_template for versioning. If successful, it occurs at compile time when a compiler performs optimization. However, OpenMP pragma turns out to fail this compile time versioning. The option is then to do versioning at the pre-processing stage using macros only.
To get rid of the inline function sum_template, I manually inline it in the macro macro_define_sum:
#include <stdlib.h>
// j can be 0 or 1
#define macro_define_sum(FUN, j) \
void FUN (size_t n, double *A, double *c) { \
if (n == 0) return; \
size_t i; \
double *a = A, * b = A + n; \
double c0 = 0.0, c1 = 0.0; \
#pragma omp simd reduction (+: c0, c1) aligned (a, b: 32) \
for (i = 0; i < n; i++) { \
c0 += a[i]; \
if (j > 0) c1 += b[i]; \
} \
c[0] = c0; \
if (j > 0) c[1] = c1; \
macro_define_sum(sum_0, 0)
macro_define_sum(sum_1, 1)
In this macro-only version, j is directly substituted by 0 or 1 at during macro expansion. Whereas in the inline function + macro approach in the question, I only have sum_template(0, n, a, b, c) or sum_template(1, n, a, b, c) at pre-processing stage, and j in the body of sum_template is only propagated at the later compile time.
Unfortunately, the above macro gives error. I can not define or test a macro inside another (see 1, 2, 3). The OpenMP pragma starting with # is causing problem here. So I have to split this template into two parts: the part before the pragma and the part after.
#include <stdlib.h>
#define macro_before_pragma \
if (n == 0) return; \
size_t i; \
double *a = A, * b = A + n; \
double c0 = 0.0, c1 = 0.0;
#define macro_after_pragma(j) \
for (i = 0; i < n; i++) { \
c0 += a[i]; \
if (j > 0) c1 += b[i]; \
} \
c[0] = c0; \
if (j > 0) c[1] = c1;
void sum_0 (size_t n, double *A, double *c) {
#pragma omp simd reduction (+: c0) aligned (a: 32)
void sum_1 (size_t n, double *A, double *c) {
#pragma omp simd reduction (+: c0, c1) aligned (a, b: 32)
I no long need macro_define_sum. I can define sum_0 and sum_1 straightaway using the defined two macros. I can also adjust the pragma appropriately. Here instead of having a template function, I have templates for code blocks of a function and can reuse them with ease.
The compiler output is as expected in this case (Check it on Godbolt).
Thanks for the various feedback; they are all very constructive (this is why I love Stack Overflow).
Thanks Marc Glisse for point me to Using an openmp pragma inside #define. Yeah, it was my bad to not have searched this issue. #pragma is an directive, not a real macro, so there must be some way to put it inside a macro. Here is the neat version using the _Pragma operator:
/* "neat.c" */
#include <stdlib.h>
// stringizing:
#define str(s) #s
// j can be 0 or 1
#define macro_define_sum(j, alignment) \
void sum_ ## j (size_t n, double *A, double *c) { \
if (n == 0) return; \
size_t i; \
double *a = A, * b = A + n; \
double c0 = 0.0, c1 = 0.0; \
_Pragma(str(omp simd reduction (+: c0, c1) aligned (a, b: alignment))) \
for (i = 0; i < n; i++) { \
c0 += a[i]; \
if (j > 0) c1 += b[i]; \
} \
c[0] = c0; \
if (j > 0) c[1] = c1; \
macro_define_sum(0, 32)
macro_define_sum(1, 32)
Other changes include:
I used token concatenation to generate function name;
alignment is made a macro argument. For AVX, a value of 32 means good alignment, while a value of 8 (sizeof(double)) essentially implies no alignment. Stringizing is required to parse those tokens into strings that _Pragma requires.
Use gcc -E neat.c to inspect pre-processing result. Compilation gives desired assembly output (Check it on Godbolt).
A few comments on Peter Cordes informative answer
Using complier's function attributes. I am not a professional C programmer. My experiences with C come merely from writing R extensions. The development environment determines that I am not very familiar with compiler attributes. I know some, but don't really use them.
-mavx256-split-unaligned-load is not an issue in my application, because I will allocate aligned memory and apply padding to ensure alignment. I just need to promise compiler of the alignment so that it can generate aligned load / store instructions. I do need to do some vectorization on unaligned data, but that contributes to a very limited part of the whole computation. Even if I get a performance penalty on split unaligned load it won't be noticed in reality. I also don't compiler every C file with auto vectorization. I only do SIMD when the operation is hot on L1 cache (i.e., it is CPU-bound not memory-bound). By the way, -mavx256-split-unaligned-load is for GCC; what is it for other compilers?
I am aware of the difference between static inline and inline. If an inline function is only accessed by one file, I will declare it as static so that compiler does not generate a copy of it.
OpenMP SIMD can do reduction efficiently even without GCC's -ffast-math. However, it does not use horizontal addition to aggregate results inside the accumulator register in the end of the reduction; it runs a scalar loop to add up each double word (see code block .L5 and .L27 in Godbolt output).
Throughput is a good point (especially for floating-point arithmetics which has relatively big latency but high throughput). My real C code where SIMD is applied is a triple loop nest. I unroll outer two loops to enlarge the code block in the innermost loop to enhance throughput. Vectorization of the innermost one is then sufficient. With the toy example in this Q & A where I just sum an array, I can use -funroll-loops to ask GCC for loop unrolling, using several accumulators to enhance throughput.
On this Q & A
I think most people would treat this Q & A in a more technical way than me. They might be interested in using compiler attributes or tweaking compiler flags / parameters to force function inlining. Therefore, Peter's answer as well as Marc's comment under the answer is still very valuable. Thanks again.

Weird C program behaviour

I have the following C program:
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <math.h>
int main() {
const int opt_count = 2;
int oc = 30;
int c = 900;
printf("%d %f\n", c, pow(oc, opt_count));
assert(c == (int)(pow(oc, opt_count)));
I'm running MinGW on Windows 8.1. Gcc version 4.9.3. I compile my program with:
gcc program.c -o program.exe
When I run it I get this output:
$ program
900 900.000000
Assertion failed: c == (int)(pow(oc, opt_count)), file program.c, line 16
This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
What is going on? I expect the assertion to pass because 900 == 30^2.
I'm not using any fractions or decimals. I'm only using integers.
This happens when the implementation of pow is via
pow(x,y) = exp(log(x)*y)
Other library implementations first reduce the exponent by integer powers, thus avoiding this small floating point error.
More involved implementations contain steps like
pow(x,y) {
if(y<0) return 1/pow(x, -y);
n = (int)round(y);
y = y-n;
px = x; powxn = 1;
while(n>0) {
if(n%2==1) powxn *= px;
n /=2; px *= px;
return powxn * exp(log(x)*y);
with the usual divide-n-conquer resp. halving-n-squaring approach for the integer power powxn.
You have a nice answer (and solution) from #LutzL, another solution is comparing the difference with an epsilon, e.g.: 0.00001, in this way you can use the standard function pow included in math.h
#define EPSILON 0.0001
#define EQ(a, b) (fabs(a - b) < EPSILON)
assert(EQ((double)c, pow(oc, opt_count)));

static pre-calculation optimization in clang

I have
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <stdio.h>
#include <math.h>
int fib(int n) {
return n < 2 ? n : fib(n-1) + fib(n-2);
double clock_now()
struct timeval now;
gettimeofday(&now, NULL);
return (double)now.tv_sec + (double)now.tv_usec/1.0e6;
#define NITER 5
and in my main(), I'm doing a simple benchmark like this:
double t = clock_now();
int f = 0;
double tmin = INFINITY;
for (int i=0; i<NITER; ++i) {
printf("run %i, %f\n", i, clock_now()-t);
t = clock_now();
f += fib(40);
t = clock_now()-t;
printf("%i %f\n", f, t);
if (t < tmin) tmin = t;
t = clock_now();
printf("fib,%.6f\n", tmin*1000);
When I compile with clang -O3 (LLVM 5.0 from Xcode 5.0.1), it always prints out zero time, except at the init of the for-loop, i.e. this:
run 0, 0.866536
102334155 0.000000
run 1, 0.000001
204668310 0.000000
run 2, 0.000000
307002465 0.000000
run 3, 0.000000
409336620 0.000000
run 4, 0.000001
511670775 0.000000
It seems that it statically precalculates the fib(40) and stores it somewhere. Right? The strange lag at the beginning (0.8 secs) is probably because it loads that cache?
I'm doing this for benchmarking. The C compiler should optimize fib() itself as much as it can. However, I don't want it to precalculate it already at compile time. So basically I want all code optimized as heavily as possible, but not main() (or at least not this specific optimization). Can I do that somehow?
What optimization is it anyway in this specific case? It's somehow strange and quite nice.
I found a solution by marking certain data volatile. Esp, what I did was:
volatile int f = 0;
volatile int FibArg = 40;
f += fib(FibArg);
That way, it forces the compiler to read FibArg when calling the function and it forces it to not assume that it is constant. Thus it must call the function to calculate it.
The volatile int f was not necessary for my compiler at the moment but it might be in the future when the compiler figures out that fib has no side effects and its result nor f is every used.
Note that this is still not the end. A future compiler could have advanced so far that it guesses that 40 is a likely argument for fib. Maybe it builds a database for likely values. And for the most likely values, it builds up a small cache. And when fib is called, it does a fast runtime-check whether it has that value cached. Of course, the runtime-check adds some overhead but maybe the compiler estimates that this overhead is minor for some particular code in relation to the speed gained by the cached.
I'm not sure if a compiler will ever do such optimization but it could. Profile Guided Optimization (PGO) goes already in that direction.
