How to optimize OpenCL kernel use of trigonometric functions?

How to optimize OpenCL kernel use of trigonometric functions? - c

I am new to OpenCL and I am struggling to fasten up my application. The OpenCL kernel takes much more time than using a sequential approach. I am trying to encrypt a 4096 x 4096 image. This is the kernel that I've written:
__kernel void image_XOR(
__constant const unsigned int *inputImage,
__global unsigned int *outputImage,
__constant double *serpentineR,
__constant double *nonce,
__global unsigned int *signature) {
unsigned int i = get_global_id(0);
double decimalsPwr = pow(10.0, 15.0), serpentine2Pwr = pow(2.0, (*serpentineR));
unsigned int aux;
unsigned long long XORseq;
unsigned int decimals = floor(decimalsPwr * fabs(*nonce));
XORseq = decimals ^ (unsigned long long) floor(( 1.0 / (i + 1)) * decimalsPwr);
if (i % 2 == 1) {
aux = floor(decimalsPwr * fabs( atan( 1.0 / tan( decimalsPwr * (double) XORseq))));
} else {
aux = floor(decimalsPwr * fabs(sin(serpentine2Pwr * (double)XORseq) * cos(serpentine2Pwr * (double)XORseq)));
}
aux = aux << 8u; // comment if alfa chanel should be crypted as well
aux = aux >> 8u;
outputImage[i] = inputImage[i] ^ aux;
*signature = *signature ^ inputImage[i] ^ aux;}
Note: If I comment out these lines the code is a lot faster (0.5s from 4s)
if (i % 2 == 1) {
aux = floor(decimalsPwr * fabs( atan( 1.0 / tan( decimalsPwr * (double) XORseq))));
} else {
aux = floor(decimalsPwr * fabs(sin(serpentine2Pwr * (double)XORseq) * cos(serpentine2Pwr * (double)XORseq)));
}

double decimalsPwr = pow(10.0, 15.0), serpentine2Pwr = pow(2.0, (*serpentineR));
serpentineR is used as scalar so pass it as scalar not via global memory which is much slower. But here I would go even further and do not do above calculations on GPU side. Just pre-calculate them in CPU and pass to the kernel. Imagine that every this calculation has to be performed 4096 x 4096 times - what a waste of resources!
unsigned int decimals = floor(decimalsPwr * fabs(*nonce)); - same here, pre-calculate on CPU and pass as parameter to the kernel. It needs to be calculated only once.
Another suggestion would be to avoid using 64 bit types in the kernel as much as possible. In most cases they are much slower in compare to 32 bit types on GPU. Let's check for example GeForce RTX 2060. Wikipedia states that the processing power for single fp precision is 5241.60 GFLOPS but for double fp precision is just 163.80 GFLOPS. That is 32x difference! If reducing precision is not an option then many times is worth to perform the 64 bit calculations in CPU and pass the results to GPU for the remaining calculations.

Let me start by saying I'm not familiar with this encryption scheme, so some of my comments may not be useful.
Before we try to spend much time optimising it, are you sure it produces consistent results? Floating-point precision isn't well defined in OpenCL, especially not for trig functions, so if you need to decrypt on another system (e.g. different GPU brand), are you obtaining sufficient precision? For example, are you able to decrypt an image on the CPU after encrypting it on the GPU?
Beyond that, some observations:
As #doqtor has pointed out already, you have a bunch of values which do not vary across work items, so precalculate those and pass them in directly.
decimals = floor(decimalsPwr * fabs(*nonce)); looks like it will overflow, assuming your nonce gets anywhere near 1.0. Is that intended?
With GPUs typically scheduling threads in lock-step, you want adjacent work items following the same conditional path. This is the opposite of what you're doing with if (i % 2 == 1). I suggest rearranging the calculation across work items such that 32 or 64 adjacent work items follow one path and the next group follows the other path.
For example, i = (i & 0xffffff80) | ((i & 0x1) << 6) | ((i & 0x3e) >> 1) should process items 0, 2, 4, 6, … 124, 126, 1, 3, 5, … 125, 127, 128, 130, … in that order. (It will require an integer multiple of 128 input pixels though unless you specially handle other cases.)This should prevent all threads performing all possible calculations and throwing away half of them.
You should be able to use some trig identities to simplify calculations. For example, sin(θ)cos(θ) = sin(2θ)/2. This will save you evaluating both cos and sin.
As I already mentioned in the comment on #doqtor's answer, *signature = *signature ^ inputImage[i] ^ aux; is not atomic, so it will not generate a predictable result. Use atomic_xor() instead. (You may want to collect the signature of a work group and only have one work item in the group update the global signature, as atomic global ops carry a performance penalty.)
As you are heavily trig bound, and trig functions on many GPUs don't use the same parts of the FPU as multiplication and addition, you may want to experiment with using explicit implementations of some of the trig functions so they can be better parallelised. Depending on the precision you need, you could also try using look-up tables.

Just some observations...
unsigned long long XORseq;
AFAIK, there is no such type in OpenCL. It might work, but it's unportable code. Also, OpenCL types are not C types; unlike C types, "unsigned long" is always 64 bits in OpenCL. IOW, OpenCL "unsigned long" == uint64_t in C; similarly for other types.
Also, trigonometric functions in OpenCL have predefined requirements on range and allowed error of result (in ULP). These requirements are so strict that most GPUs (especially consumer) simply don't have the hardware to calculate this with a single hardware instruction, because (for games) you don't need it 99.999% of time, and it would just take up silicon. GPUs do have hardware sin/cos instructions, but these have much more limited precision and range - just enough for graphics. You can use them from OpenCL with "native_cos" and "native_sin". The "full" sin/cos (which is what your code has) is calculated using some routines - so it's quite slow. Additionally, your code using doubles instead of floats slows it down further (by 8-32x on consumer GPUs).
I'm not sure why you decided to use double trigo for encryption, but i doubt it'll ever be very fast on consumer GPUs.
Plus there's another problem - different OpenCL implementations (AMD, Nvidia, Intel etc) can return different results for trigo functions; OpenCL only specifies the maximum ULP error. So e.g. if you try to encrypt on nvidia GPU opencl & then decrypt on intel CPU opencl, you won't necessarily get back the original.

Related

Fast hashing of 32 bit values to between 0 and 254 inclusive

I'm looking for a fast way in C to hash numbers 32-bit numbers more or less uniformly between 0 and 254. 255 is reserved for a special purpose.
As an added constraint, I'm looking for a method that would map well to being used with ISA-specific vector intrinsics or to a language like OpenCL or CUDA without introducing control flow divergence between the vector lanes/threads.
Ordinarily, I would just use the following code to hash the number between 0 and 255, as this is just a fast way of doing x mod 256.
inline uint8_t hash(uint32_t x){ return x & 255; }
I could just give in and use the following:
inline uint8_t hash(uint32_t x){ return x % 255; }
However, this solution seems unimaginative and unlikely to be the highest performing solution. I found code at this site (http://homepage.cs.uiowa.edu/~jones/bcd/mod.shtml#exmod15) that appears to provide a reasonable solution for scalar code and have inserted it here for your convenience.
uint32_t mod255( uint32_t a ) {
a = (a >> 16) + (a & 0xFFFF); /* sum base 2**16 digits */
a = (a >> 8) + (a & 0xFF); /* sum base 2**8 digits */
if (a < 255) return a;
if (a < (2 * 255)) return a - 255;
return a - (2 * 255);
}
I see two potential performance issues with this code:
The large number of if statements makes me question how easy it will be for a compiler or human :) to effectively vectorize the code without leading to control flow divergence within a warp/wavefront on a SIMT architecture or vectorized execution on a multicore CPU. If such divergence does occur, it will reduce parallel efficiency, as the divergent paths will have to be run in series.
It looks like it could be troublesome for a branch predictor (not applicable on common GPU architectures) as the code path that executes depends on the value of the input. Therefore, if there is a mix of small and large values interspersed with one another, this code will likely sacrifice some performance due to a moderate number of branch mispredictions.
Any recommendations on alternatives that I could use are most welcome. Alternatively, let me know if what I am asking for is unreasonable.

The "if statements on GPU kill performance" is a popular misconception which desperately wants to live on, it seems.
The large number of if statements makes me question how easy it will
be for a compiler or human :) to vectorize the code.
First of all I wouldn't consider 2 if statements a "large number of if statements", and those are so short and trivial that I'm willing to bet the compiler will turn them into branchless conditional moves or predicated instructions. There will be no performance penalty at all. (Do check the generated assembly, however).
It looks like it could be troublesome for a branch predictor as the code path that executes depends on the value of the input. Therefore, if there is a mix of small and large values interspersed with one another, this code will likely sacrifice some performance due to a moderate number of branch mispredictions.
Current GPUs do not have branch predictors. Note however that depending on the underlying hardware, operation on integers (and notably shifting) may be quite costly.

I would just do this:
uchar fast_mod255( uint a32 ) {
ushort a16 = (a32 >> 16) + (a32 & 0xFFFF); /* sum base 2**16 digits */
uchar a8 = (a16 >> 8) + (a16 & 0xFF); /* sum base 2**8 digits */
return (a8 % 255);
}
Another option is to just do:
uchar fast_mod255( uchar4 a ) {
return (dot(a) % 255); // or return (distance(a) % 255);
}
GPUs are very efficient in computing the distances and dot products, even in 4 dimensions. And it is a valid way of hashing as well. Dsicarding the overflowed values.
No branching, and a clever compiler can even optimize it out. Or do you really need that values that fall in the 255 zone have a scattered pattern instead of 1?

I wanted to answer my own question because over the last 2 years I have seen ways to get around a slow integer divide instruction. The easiest way is to make the integer a compile-time constant. Any decent modern compiler should replace the integer divide with an equivalent set of other instructions with typically higher throughput (how many such instructions can be retired per cycle) and reduced latency (how many cycles it takes the instruction to execute). If you're curious, check out Hacker's Delight (an excellent book on low-level computer arithmetic).
I wanted to share another finding, which I found on Daniel Lemire's blog (located here). The code that follows doesn't compute mod 255 but does something similar, which is equally useful in a number of applications and much faster.
Suppose that you have a set of numbers S that are uniformly randomly picked from the range 0 to 2^k - 1 inclusive, where k >= 0. In this case, if you care only about mapping numbers roughly uniformly from 0 to 254 inclusive, you may do the following:
For each number n in a set S, you may map n to one of the 255 candidate values by multiplying n by 255 and then arithmetically shifting the result to the right by k digits.
Here is the function that you call on each n for a fixed value of k:
int map_to_0_to_254(int n, int k){
return (n * 255) >> k;
}
As an example, if the values for the argument n range uniformly randomly from 0 to 4095 (2^12 - 1),
then map_to_0_254(n, 12) will return a value in the range 0 to 254 inclusive.
Here is a more general templated version in C++ for mapping to range from 0 to range_size - 1 inclusive:
template<typename T>
T map_to_0_to_range_size_minus_1(T n, T range_size, T k){
return (n * range_size) >> k;
}
REMEMBER that this code assumes that the inputs for n are roughly uniformly randomly distributed between 0 and 2^k - 1 inclusive. If that property holds, then the outputs will be roughly uniformly distributed between 0 and range_size - 1 inclusive. The larger 2^k is relative to range_size, the more uniform the mapping will be for a fixed set of inputs.
Why This is Useful
This approach has applications to computing hash functions for hash tables where the number of bins is not a power of 2. Those operations would ordinarily require a long-latency integer divide instruction, which is often an order of magnitude slower to execute than an integer multiply, because you often do not know the number of bins in the hash table at compile time.

What are the approaches to SW floating-point implementation?

I need to be able to use floating-point arithmetic under my dev environment in C (CPU: ~12 MHz Motorola 68000). The standard library is not present, meaning it is a bare-bones C and no - it isn't gcc due to several other issues
I tried getting the SoftFloat library to compile and one other 68k-specific FP library (name of which escapes me at this moment), but their dependencies cannot be resolved for this particular platform - mostly due to libc deficiencies. I spent about 8 hrs trying to overcome the linking issues, until I got to a point where I know I can't get further.
However, it took mere half an hour to come up with and implement the following set of functions that emulate floating-point functionality sufficiently for my needs.
The basic idea is that both fractional and non-fractional part are 16-bit integers, thus there is no bit manipulation.
The nonfractional part has a range of [-32767, 32767] and the fractional part [-0.9999, +0.9999] - which gives us 4 digits of precision (good enough for my floating-point needs - albeit wasteful).
It looks to me, like this could be used to make a faster, smaller - just 2 Bytes-big - alternative version of a float with ranges [-99, +99] and [-0.9, +0.9]
The question here is, what other techniques - other than IEEE - are there to make an implementation of basic floating-point functionality (+ - * /) using fixed-point functionality?
Later on, I will need some basic trigonometry, but there are lots of resources on net for that.
Since the HW has 2 MBs of RAM, I don't really care if I can save 2 bytes per soft-float (say - by reserving 9 vs 7 bits in an int). Thus - 4 bytes are good enough.
Also, from brief looking at the 68k instruction manual (and the cycle costs of each instruction), I made few early observations:
bit-shifting is slow, and unless performance is of critical importance (not the case here), I'd prefer easy debugging of my soft-float library versus 5-cycles-faster code. Besides, since this is C and not 68k ASM, it is obvious that speed is not a critical factor.
8-bit operands are as slow as 16-bit (give or take a cycle in most cases), thus it looks like it does not make much sense to compress floats for the sake of performance.
What improvements / approaches would you propose to implement floating-point in C using fixed-point without any dependency on other library/code?
Perhaps it would be possible to use a different approach and do the operations on frac & non-frac parts at the same time?
Here is the code (tested only using the calculator), please ignore the C++ - like declaration and initialization in the middle of functions (I will reformat that to C-style later):
inline int Pad (int f) // Pad the fractional part to 4 digits
{
if (f < 10) return f*1000;
else if (f < 100) return f*100;
else if (f < 1000) return f*10;
else return f;
}
// We assume fractional parts are padded to full 4 digits
inline void Add (int & b1, int & f1, int b2, int f2)
{
b1 += b2;
f1 +=f2;
if (f1 > 9999) { b1++; f1 -=10000; }
else if (f1 < -9999) { b1--; f1 +=10000; }
f1 = Pad (f1);
}
inline void Sub (int & b1, int & f1, int b2, int f2)
{
// 123.1652 - 18.9752 = 104.1900
b1 -= b2; // 105
f1 -= f2; // -8100
if (f1 < 0) { b1--; f1 +=10000; }
f1 = Pad (f1);
}
// ToDo: Implement a multiplication by float
inline void Mul (int & b1, int & f1, int num)
{
// 123.9876 * 251 = 31120.8876
b1 *=num; // 30873
long q = f1*num; //2478876
int add = q/10000; // 247
b1+=add; // 31120
f1 = q-(add*10000);//8876
f1 = Pad (f1);
}
// ToDo: Implement a division by float
inline void Div (int & b1, int & f1, int num)
{
// 123.9876 / 25 = 4.959504
int b2 = b1/num; // 4
long q = b1 - (b2*num); // 23
f1 = ((q*10000) + f1) / num; // (23000+9876) / 25 = 9595
b1 = b2;
f1 = Pad (f1);
}

You are thinking in the wrong base for a simple fixed point implementation. It is much easier if you use the bits for the decimal place. e.g. using 16 bits for the integer part and 16 bits for the decimal part (range -32767/32767, precision of 1/2^16 which is a lot higher precision than you have).
The best part is that addition and subtraction are simple (just add the two parts together). Multiplication is a little bit trickier: you have to be aware of overflow and so it helps to do the multiplication in 64 bit. You also have to shift the result after the multiplication (by however many bits are in your decimal).
typedef int fixed16;
fixed16 mult_f(fixed16 op1, fixed16 op2)
{
/* you may need to do something tricky with upper and lower if you don't
* have native 64 bit but the compiler might do it for us if we are lucky
*/
uint64_t tmp;
tmp = (op1 * op2) >> 16;
/* add in error handling for overflow if you wish - this just wraps */
return tmp & 0xFFFFFFFF;
}
Division is similar.
Someone might have implemented almost exactly what you need (or that can be hacked to make it work) that's called libfixmath

If you decide to use fixed-point, the whole number (i.e both int and fractional parts) should be in the same base. Using binary for the int part and decimal for the fractional part as above is not very optimal and slows down the calculation. Using binary fixed-point you'll only need to shift an appropriate amount after each operation instead of long adjustments like your idea. If you want to use Q16.16 then libfixmath as dave mentioned above is a good choice. If you want a different precision or floating point position such as Q14.18, Q19.13 then write your own library or modify some library for your own use. Some examples
BoostGSoC15/fixed_point
https://github.com/johnmcfarlane/cnl
See also What's the best way to do fixed-point math?
If you want a larger range then floating point maybe the better choice. Write a library as your own requirements, choose a format that is easy to implement and easy to achieve good performance in software, no need to follow IEEE 754 specifications (which is only fast with hardware implementations due to the odd number of bits and strange exponent bits' position) unless you intend to exchange data with other devices. For example a format of exp.sign.significand with 7 exponent bits followed by a sign bit and then 24 bits of significand. The exponent doesn't need to be biased, so to get the base only an arithmetic shift by 25 is needed, the sign bit will also be extended. But in case the shift is slower than a subtraction then excess-n is better.

Fast float to int conversion (truncate)

I'm looking for a way to truncate a float into an int in a fast and portable (IEEE 754) way. The reason is because in this function 50% of the time is spent in the cast:
float fm_sinf(float x) {
const float a = 0.00735246819687011731341356165096815f;
const float b = -0.16528911397014738207016302002888890f;
const float c = 0.99969198629596757779830113868360584f;
float r, x2;
int k;
/* bring x in range */
k = (int) (F_1_PI * x + copysignf(0.5f, x)); /* <-- 50% of time is spent in cast */
x -= k * F_PI;
/* if x is in an odd pi count we must flip */
r = 1 - 2 * (k & 1); /* trick for r = (k % 2) == 0 ? 1 : -1; */
x2 = x * x;
return r * x*(c + x2*(b + a*x2));
}

The slowness of float->int casts mainly occurs when using x87 FPU instructions on x86. To do the truncation, the rounding mode in the FPU control word needs to be changed to round-to-zero and back, which tends to be very slow.
When using SSE instead of x87 instructions, a truncation is available without control word changes. You can do this using compiler options (like -mfpmath=sse -msse -msse2 in GCC) or by compiling the code as 64-bit.
The SSE3 instruction set has the FISTTP instruction to convert to integer with truncation without changing the control word. A compiler may generate this instruction if instructed to assume SSE3.
Alternatively, the C99 lrint() function will convert to integer with the current rounding mode (round-to-nearest unless you changed it). You can use this if you remove the copysignf term. Unfortunately, this function is still not ubiquitous after more than ten years.

I found a fast truncate method by Sree Kotay which provides exactly the optimization that I needed.

to be portable you would have to add some directives and learn a couple assembler languages but you could theoretically could use some inline assembly to move portions of the floating point register into eax/rax ebx/rbx and convert what you would need by hand, floating point specification though is a pain in the butt, but I am pretty certain that if you do it with assembly you will be way faster, as your needs are very specific and the system method is probably more generic and less efficient for your purpose

You could skip the conversion to int altogether by using frexpf to get the mantissa and exponent, and inspect the raw mantissa (use a union) at the appropriate bit position (calculated using the exponent) to determine (the quadrant dependent) r.

Faster way of finding multiple of double

If have the following C function, used to determine if one number is a multiple of another to an arbirary tolerance
#include <math.h>
#define TOLERANCE 0.0001
int IsMultipleOf(double x,double mod)
{
return(fabs(fmod(x, mod)) < TOLERANCE);
}
It works fine, but profiling shows it to be very slow, to the extent that it has become a candidate for optimization. About 75% of the time is spent in modulo and the remaining in fabs. I'm trying to figure a way of speeding things up, using something like a look-up table. The parameter x changes regularly, whereas mod changes infrequently. The number of possible values of x is small enough that the space for a look-up would not be an issue, typically it will be one of a few hundred possible values. I can get rid of the fabs easily enough, but can't figure out a reasonable alternative to the modulo. Any ideas on how to optimize the above?
Edit The code will be running on a wide range of Windows desktop and mobile devices, hence processors could include Intel, AMD on desktop, and ARM or SH4 on mobile devices. VisualStudio 2008 is the compiler.

Do you really have to use modulo for this?
Wouldn't it be possible to just result = x / mod and then check if the decimal part of result is close to 0. For instance:
11 / 5.4999 = 2.000003 ==> 0.000003 < TOLERANCE
Or something like that.

Division (floating point or not, fmod in your case) is often an operation where the execution time varies a lot depending on the cpu and compiler:
gcc has a builtin replacement for
that if you give it the right compile
flags or if you use __builtin_fmod
explicitly. This then might map the
operation on a small number of
assembler instructions.
there may be special units like SSE
on intel processors where this
operation is implemented more
efficiently
By such tricks, depending on your environment (you didn't tell which) the time may vary from some clock cycles to some hundred. I think best is to look into the documentation of your compiler and cpu for that particular operation.

The following is probably overkill, and sub-optimal. But for what it is worth here is one way on how to do it.
We know the format of the double ...
1 bit for the sign
11 bits for the biased exponent
52 fraction bits
Let ...
value = x / mod;
exp = exponent bits of value - BIAS;
lsb = least sig bit of value's fraction bits;
Once you have that ...
/*
* If applying the exponent would eliminate the fraction bits
* then for double precision resolution it is a multiple.
* Note: lsb may require some massaging.
*/
if (exp > lsb)
return (true);
if (exp < 0)
return (false);
The only case remaining is the tolerance case. Build your double so that you are getting rid of all the digits to the left of the decimal.
sign bit is zero (positive)
exponent is the BIAS (1023 I think ... look it up to be sure)
shift the fraction bits as appropriate
Now compare it against your tolerance.

I think you need to inspect the bowels of your C RTL fmod() function: X86 FPU's have 'FPREM/FPREM1' instructions which computes remainders by repeated subtraction.
While floating point division is a single instruction, it seems you may need to call FPREM repeatedly to get the right answer for modulus, so your RTL may not use it.

I have not tested this at all, but from the way I understand fmod this should be equivalent inlined, which might let the compiler optimize it better, though I would have thought that the compiler's math library (or builtins) would work just as well. (also, I don't even know for sure if this is correct).
#include <math.h>
int IsMultipleOf(double x, double mod) {
long n = x / mod; // You should probably test for /0 or NAN result here
double new_x = mod * n;
double delta = x - new_x;
return fabs(delta) < TOLERANCE; // and for NAN result from fabs
}

Maybe you can get away with long long instead of double if you have comparable scale of data. For example long long would be enough for over 60 astronomical units in micrometer resolution.

Does it need to be double precision ? Depending on how good your math library is, this ought to be faster:
#include <math.h>
#define TOLERANCE 0.0001f
bool IsMultipleOf(float x, float mod)
{
return(fabsf(fmodf(x, mod)) < TOLERANCE);
}

I presume modulo looks a little like this on the inside:
mod(x,m) {
while (x > m) {
x = x - m
}
return x
}
I think that through some sort of search i could be optimised: eg:
fastmod(x,m) {
q = 1
while (m * q < x) {
q = q * 2
}
return mod((x - (q / 2) * m), m)
}
You might even choose to replace the finall call to mod with annother call to fastmod, adding the condition that if x < m then to return x.

Why is floor() so slow?

I wrote some code recently (ISO/ANSI C), and was surprised at the poor performance it achieved. Long story short, it turned out that the culprit was the floor() function. Not only it was slow, but it did not vectorize (with Intel compiler, aka ICL).
Here are some benchmarks for performing floor for all cells in a 2D matrix:
VC: 0.10
ICL: 0.20
Compare that to a simple cast:
VC: 0.04
ICL: 0.04
How can floor() be that much slower than a simple cast?! It does essentially the same thing (apart for negative numbers).
2nd question: Does someone know of a super-fast floor() implementation?
PS: Here is the loop that I was benchmarking:
void Floor(float *matA, int *intA, const int height, const int width, const int width_aligned)
{
float *rowA=NULL;
int *intRowA=NULL;
int row, col;
for(row=0 ; row<height ; ++row){
rowA = matA + row*width_aligned;
intRowA = intA + row*width_aligned;
#pragma ivdep
for(col=0 ; col<width; ++col){
/*intRowA[col] = floor(rowA[col]);*/
intRowA[col] = (int)(rowA[col]);
}
}
}

A couple of things make floor slower than a cast and prevent vectorization.
The most important one:
floor can modify the global state. If you pass a value that is too huge to be represented as an integer in float format, the errno variable gets set to EDOM. Special handling for NaNs is done as well. All this behavior is for applications that want to detect the overflow case and handle the situation somehow (don't ask me how).
Detecting these problematic conditions is not simple and makes up more than 90% of the execution time of floor. The actual rounding is cheap and could be inlined/vectorized. Also It's a lot of code, so inlining the whole floor-function would make your program run slower.
Some compilers have special compiler flags that allow the compiler to optimize away some of the rarely used c-standard rules. For example GCC can be told that you're not interested in errno at all. To do so pass -fno-math-errno or -ffast-math. ICC and VC may have similar compiler flags.
Btw - You can roll your own floor-function using simple casts. You just have to handle the negative and positive cases differently. That may be a lot faster if you don't need the special handling of overflows and NaNs.

If you are going to convert the result of the floor() operation to an int, and if you aren't worried about overflow, then the following code is much faster than (int)floor(x):
inline int int_floor(double x)
{
int i = (int)x; /* truncate */
return i - ( i > x ); /* convert trunc to floor */
}

Branch-less Floor and Ceiling (better utilize the pipiline) no error check
int f(double x)
{
return (int) x - (x < (int) x); // as dgobbi above, needs less than for floor
}
int c(double x)
{
return (int) x + (x > (int) x);
}
or using floor
int c(double x)
{
return -(f(-x));
}

The actual fastest implementation for a large array on modern x86 CPUs would be
change the MXCSR FP rounding mode to round towards -Infinity (aka floor). In C, this should be possible with fenv stuff, or _mm_getcsr / _mm_setcsr.
loop over the array doing _mm_cvtps_epi32 on SIMD vectors, converting 4 floats to 32-bit integer using the current rounding mode. (And storing the result vectors to the destination.)
cvtps2dq xmm0, [rdi] is a single micro-fused uop on any Intel or AMD CPU since K10 or Core 2. (https://agner.org/optimize/) Same for the 256-bit AVX version, with YMM vectors.
restore the current rounding mode to the normal IEEE default mode, using the original value of the MXCSR. (round-to-nearest, with even as a tiebreak)
This allows loading + converting + storing 1 SIMD vector of results per clock cycle, just as fast as with truncation. (SSE2 has a special FP->int conversion instruction for truncation, exactly because it's very commonly needed by C compilers. In the bad old days with x87, even (int)x required changing the x87 rounding mode to truncation and then back. cvttps2dq for packed float->int with truncation (note the extra t in the mnemonic). Or for scalar, going from XMM to integer registers, cvttss2si or cvttsd2si for scalar double to scalar integer.
With some loop unrolling and/or good optimization, this should be possible without bottlenecking on the front-end, just 1-per-clock store throughput assuming no cache-miss bottlenecks. (And on Intel before Skylake, also bottlenecked on 1-per-clock packed-conversion throughput.) i.e. 16, 32, or 64 bytes per cycle, using SSE2, AVX, or AVX512.
Without changing the current rounding mode, you need SSE4.1 roundps to round a float to the nearest integer float using your choice of rounding modes. Or you could use one of the tricks shows in other answers that work for floats with small enough magnitude to fit in a signed 32-bit integer, since that's your ultimate destination format anyway.)
(With the right compiler options, like -fno-math-errno, and the right -march or -msse4 options, compilers can inline floor using roundps, or the scalar and/or double-precision equivalent, e.g. roundsd xmm1, xmm0, 1, but this costs 2 uops and has 1 per 2 clock throughput on Haswell for scalar or vectors. Actually, gcc8.2 will inline roundsd for floor even without any fast-math options, as you can see on the Godbolt compiler explorer. But that's with -march=haswell. It's unfortunately not baseline for x86-64, so you need to enable it if your machine supports it.)

Yes, floor() is extremely slow on all platforms since it has to implement a lot of behaviour from the IEEE fp spec. You can't really use it in inner loops.
I sometimes use a macro to approximate floor():
#define PSEUDO_FLOOR( V ) ((V) >= 0 ? (int)(V) : (int)((V) - 1))
It does not behave exactly as floor(): for example, floor(-1) == -1 but PSEUDO_FLOOR(-1) == -2, but it's close enough for most uses.

An actually branchless version that requires a single conversion between floating point and integer domains would shift the value x to all positive or all negative range, then cast/truncate and shift it back.
long fast_floor(double x)
{
const unsigned long offset = ~(ULONG_MAX >> 1);
return (long)((unsigned long)(x + offset) - offset);
}
long fast_ceil(double x) {
const unsigned long offset = ~(ULONG_MAX >> 1);
return (long)((unsigned long)(x - offset) + offset );
}
As pointed in the comments, this implementation relies on the temporary value x +- offset not overflowing.
On 64-bit platforms, the original code using int64_t intermediate value will result in three instruction kernel, the same available for int32_t reduced range floor/ceil, where |x| < 0x40000000 --
inline int floor_x64(double x) {
return (int)((int64_t)(x + 0x80000000UL) - 0x80000000LL);
}
inline int floor_x86_reduced_range(double x) {
return (int)(x + 0x40000000) - 0x40000000;
}

They do not do the same thing. floor() is a function. Therefore, using it incurs a function call, allocating a stack frame, copying of parameters and retrieving the result.
Casting is not a function call, so it uses faster mechanisms (I believe that it may use registers to process the values).
Probably floor() is already optimized.
Can you squeeze more performance out of your algorithm? Maybe switching rows and columns may help? Can you cache common values? Are all your compiler's optimizations on? Can you switch an operating system? a compiler?
Jon Bentley's Programming Pearls has a great review of possible optimizations.

Fast double round
double round(double x)
{
return double((x>=0.5)?(int(x)+1):int(x));
}
Terminal log
test custom_1 8.3837
test native_1 18.4989
test custom_2 8.36333
test native_2 18.5001
test custom_3 8.37316
test native_3 18.5012
Test
void test(char* name, double (*f)(double))
{
int it = std::numeric_limits<int>::max();
clock_t begin = clock();
for(int i=0; i<it; i++)
{
f(double(i)/1000.0);
}
clock_t end = clock();
cout << "test " << name << " " << double(end - begin) / CLOCKS_PER_SEC << endl;
}
int main(int argc, char **argv)
{
test("custom_1",round);
test("native_1",std::round);
test("custom_2",round);
test("native_2",std::round);
test("custom_3",round);
test("native_3",std::round);
return 0;
}
Result
Type casting and using your brain is ~3 times faster than using native functions.