Fast float to int conversion (truncate)

Fast float to int conversion (truncate) - c

I'm looking for a way to truncate a float into an int in a fast and portable (IEEE 754) way. The reason is because in this function 50% of the time is spent in the cast:
float fm_sinf(float x) {
const float a = 0.00735246819687011731341356165096815f;
const float b = -0.16528911397014738207016302002888890f;
const float c = 0.99969198629596757779830113868360584f;
float r, x2;
int k;
/* bring x in range */
k = (int) (F_1_PI * x + copysignf(0.5f, x)); /* <-- 50% of time is spent in cast */
x -= k * F_PI;
/* if x is in an odd pi count we must flip */
r = 1 - 2 * (k & 1); /* trick for r = (k % 2) == 0 ? 1 : -1; */
x2 = x * x;
return r * x*(c + x2*(b + a*x2));
}

The slowness of float->int casts mainly occurs when using x87 FPU instructions on x86. To do the truncation, the rounding mode in the FPU control word needs to be changed to round-to-zero and back, which tends to be very slow.
When using SSE instead of x87 instructions, a truncation is available without control word changes. You can do this using compiler options (like -mfpmath=sse -msse -msse2 in GCC) or by compiling the code as 64-bit.
The SSE3 instruction set has the FISTTP instruction to convert to integer with truncation without changing the control word. A compiler may generate this instruction if instructed to assume SSE3.
Alternatively, the C99 lrint() function will convert to integer with the current rounding mode (round-to-nearest unless you changed it). You can use this if you remove the copysignf term. Unfortunately, this function is still not ubiquitous after more than ten years.

I found a fast truncate method by Sree Kotay which provides exactly the optimization that I needed.

to be portable you would have to add some directives and learn a couple assembler languages but you could theoretically could use some inline assembly to move portions of the floating point register into eax/rax ebx/rbx and convert what you would need by hand, floating point specification though is a pain in the butt, but I am pretty certain that if you do it with assembly you will be way faster, as your needs are very specific and the system method is probably more generic and less efficient for your purpose

You could skip the conversion to int altogether by using frexpf to get the mantissa and exponent, and inspect the raw mantissa (use a union) at the appropriate bit position (calculated using the exponent) to determine (the quadrant dependent) r.

Related

C: Representing a fraction without floating points

I'm writing some code for an embedded system (MSP430) without hardware floating point support. Unfortunately, I will need to work with fractions in my code as I'm doing ranging, and a short-range sensor with a precision of 1m isn't a very good sensor.
I can do as much of the math as I need in ints, but by the end there are two values that I will definitely need to have fractions on; range and speed. Range will be a value between 2-500 (cm), while speed should be no higher than -10 to 10 (ms^-1). I am unsure how to represent them without floating point values, if it is possible. A simple way of rounding the fractions up or down would be best.
Some sample code I have:
voltage_difference_new = ((memval3_new - memval4_new)*3.3/4096);
where memval3_new and memval4_new are ints, but voltage_difference_new is a float.
Please let me know if more information is needed. Or if there is a blindingly easy fix.

You have rather answered your own question with the statement:
Range will be a value between 2-500 (cm),
Work in centimetre (or even millimetre) rather than metre units.
That said you don't need floating-point hardware to do floating point math; the compiler will support "soft" floating point and generate the code to perform floating point operations - it will be slower than hardware floating point or integer operations, but that may not be an issue in your application.
Nonetheless there are many reasons to avoid floating-point even with hardware support and it does not sound like your case for FP is particularly compelling, but it is hard to tell without seeing your code and a specific example. In 32 years of embedded systems development I have seldom resorted to FP even for trig, log, sqrt and digital signal processing.
A general method is to use a fixed point presentation. My earlier suggestion of using centimetres is an example of decimal fixed point, but for greater efficiency you should use binary fixed point. For example you might represent distance in 1/1024 metre units (giving > 1 mm precision). Because the fixed point is binary, all the necessary rescaling can be done with shifts rather than more expensive multiply/divide operations.
For example, say you have an 8 bit sensor generating linear output 0 to 255 corresponding to a real distance 0 to 0.5 metre.
#define Q10_SHIFT = 10 ; // 10 bits fractional (1/1024)
typedef int q10_t ;
#define ONE_METRE = (1 << Q10_SHIFT)
#define SENSOR_MAX = 255
#define RANGE_MAX = (ONE_METRE/2)
q10_t distance = read_sensor() * RANGE_MAX / SENSOR_MAX ;
distance is in Q10 fixed point representation. Performing addition and subtraction on such is normal integer arithmentic, multiply and divide require scaling:
int q10_add( q10_t a, q10_t b )
{
return a + b ;
}
int q10_sub( q10_t a, q10_t b )
{
return a - b ;
}
int q10_mul( q10_t a, q10_t b )
{
return (a * b) >> Q10_SHIFT ;
}
int q10_div( q10_t a, q10_t b )
{
return (a << Q10_SHIFT) / b ;
}
Of course you may want to be able to mix types and say multiply a q10_t by an int - providing a comprehensive library for fixed-point can get complex. Personally for that I use C++ where you have classes, function overloading and operator overloading to support more natural code. But unless your code has a great deal of general fixed point math, it may be simpler to code specific fixed point operations ad-hoc.
To take the one example you have provided:
double voltage_difference_new = ((memval3_new - memval4_new)*3.3/4096);
The floating-point there is trivially removed using millivolts:
int voltage_difference_new_mv = ((memval3_new - memval4_new) * 3300) /4096 ;
The issue then perhaps becomes one of presentation. For example if you have to present or report the value in volts to a user. In that case:
int volt_fract = abs(voltage_difference_new_mv % 1000) ;
int volt_whole = voltage_difference_new_mv / 1000 ;
printf( "%d.%04d", volt_whole, volt_fract ) ;

Power function giving different answer than math.pow function in C

I was trying to write a program to calculate the value of x^n using a while loop:
#include <stdio.h>
#include <math.h>
int main()
{
float x = 3, power = 1, copyx;
int n = 22, copyn;
copyx = x;
copyn = n;
while (n)
{
if ((n % 2) == 1)
{
power = power * x;
}
n = n / 2;
x *= x;
}
printf("%g^%d = %f\n", copyx, copyn, power);
printf("%g^%d = %f\n", copyx, copyn, pow(copyx, copyn));
return 0;
}
Up until the value of 15 for n, the answer from my created function and the pow function (from math.h) gives the same value; but, when the value of n exceeds 15, then it starts giving different answers.
I cannot understand why there is a difference in the answer. Is it that I have written the function in the wrong way or it is something else?

You are mixing up two different types of floating-point data. The pow function uses the double type but your loop uses the float type (which has less precision).
You can make the results coincide by either using the double type for your x, power and copyx variables, or by calling the powf function (which uses the float type) instead of pow.
The latter adjustment (using powf) gives the following output (clang-cl compiler, Windows 10, 64-bit):
3^22 = 31381059584.000000
3^22 = 31381059584.000000
And, changing the first line of your main to double x = 3, power = 1, copyx; gives the following:
3^22 = 31381059609.000000
3^22 = 31381059609.000000
Note that, with larger and larger values of n, you are increasingly likely to get divergence between the results of your loop and the value calculated using the pow or powf library functions. On my platform, the double version gives the same results, right up to the point where the value overflows the range and becomes Infinity. However, the float version starts to diverge around n = 55:
3^55 = 174449198498104595772866560.000000
3^55 = 174449216944848669482418176.000000

When I run your code I get this:
3^22 = 31381059584.000000
3^22 = 31381059609.000000
This would be because pow returns a double but your code uses float. When I changed to powf I got identical results:
3^22 = 31381059584.000000
3^22 = 31381059584.000000
So simply use double everywhere if you need high resolution results.

Floating point math is imprecise (and float is worse than double, having even fewer bits to store the data in; using double might delay the imprecision longer). The pow function (usually) uses an exponentiation algorithm that minimizes precision loss, and/or delegates to a chip-level instruction that may do stuff more efficiently, more precisely, or both. There could be more than one implementation of pow too, depending on whether you tell the compiler to use strictly conformant floating point math, the fastest possible, the hardware instruction, etc.
Your code is fine (though using double would get more precise results), but matching the improved precision of math.h's pow is non-trivial; by the time you've done so, you'll have reinvented it. That's why you use the library function.
That said, for logically integer math as you're using here, precision loss from your algorithm likely doesn't matter, it's purely the float vs. double issue where you lose precision from the type itself. As a rule, default to using double, and only switch to float if you're 100% sure you don't need the precision and can't afford the extra memory/computation cost of double.

Precision
float x = 3, power = 1; ... power = power * x forms a float product.
pow(x, y) forms a double result and good implementations internally use even wider math.
OP's loop method incurs rounded results after the 15th iteration. These roundings slowly compound the inaccuracy of the final result.
316 is a 26 bit odd number.
float encodes all odd numbers exactly until typically 224. Larger values are all even and of only 24 significant binary digits.
double encodes all odd numbers exactly until typically 253.
To do a fair comparison, use:
double objects and pow() or
float objects and powf().
For large powers, the pow(f)() function is certain to provide better answers than a loop at such functions often use internally extended precision and well managed rounding vs. the loop approach.

What are the approaches to SW floating-point implementation?

I need to be able to use floating-point arithmetic under my dev environment in C (CPU: ~12 MHz Motorola 68000). The standard library is not present, meaning it is a bare-bones C and no - it isn't gcc due to several other issues
I tried getting the SoftFloat library to compile and one other 68k-specific FP library (name of which escapes me at this moment), but their dependencies cannot be resolved for this particular platform - mostly due to libc deficiencies. I spent about 8 hrs trying to overcome the linking issues, until I got to a point where I know I can't get further.
However, it took mere half an hour to come up with and implement the following set of functions that emulate floating-point functionality sufficiently for my needs.
The basic idea is that both fractional and non-fractional part are 16-bit integers, thus there is no bit manipulation.
The nonfractional part has a range of [-32767, 32767] and the fractional part [-0.9999, +0.9999] - which gives us 4 digits of precision (good enough for my floating-point needs - albeit wasteful).
It looks to me, like this could be used to make a faster, smaller - just 2 Bytes-big - alternative version of a float with ranges [-99, +99] and [-0.9, +0.9]
The question here is, what other techniques - other than IEEE - are there to make an implementation of basic floating-point functionality (+ - * /) using fixed-point functionality?
Later on, I will need some basic trigonometry, but there are lots of resources on net for that.
Since the HW has 2 MBs of RAM, I don't really care if I can save 2 bytes per soft-float (say - by reserving 9 vs 7 bits in an int). Thus - 4 bytes are good enough.
Also, from brief looking at the 68k instruction manual (and the cycle costs of each instruction), I made few early observations:
bit-shifting is slow, and unless performance is of critical importance (not the case here), I'd prefer easy debugging of my soft-float library versus 5-cycles-faster code. Besides, since this is C and not 68k ASM, it is obvious that speed is not a critical factor.
8-bit operands are as slow as 16-bit (give or take a cycle in most cases), thus it looks like it does not make much sense to compress floats for the sake of performance.
What improvements / approaches would you propose to implement floating-point in C using fixed-point without any dependency on other library/code?
Perhaps it would be possible to use a different approach and do the operations on frac & non-frac parts at the same time?
Here is the code (tested only using the calculator), please ignore the C++ - like declaration and initialization in the middle of functions (I will reformat that to C-style later):
inline int Pad (int f) // Pad the fractional part to 4 digits
{
if (f < 10) return f*1000;
else if (f < 100) return f*100;
else if (f < 1000) return f*10;
else return f;
}
// We assume fractional parts are padded to full 4 digits
inline void Add (int & b1, int & f1, int b2, int f2)
{
b1 += b2;
f1 +=f2;
if (f1 > 9999) { b1++; f1 -=10000; }
else if (f1 < -9999) { b1--; f1 +=10000; }
f1 = Pad (f1);
}
inline void Sub (int & b1, int & f1, int b2, int f2)
{
// 123.1652 - 18.9752 = 104.1900
b1 -= b2; // 105
f1 -= f2; // -8100
if (f1 < 0) { b1--; f1 +=10000; }
f1 = Pad (f1);
}
// ToDo: Implement a multiplication by float
inline void Mul (int & b1, int & f1, int num)
{
// 123.9876 * 251 = 31120.8876
b1 *=num; // 30873
long q = f1*num; //2478876
int add = q/10000; // 247
b1+=add; // 31120
f1 = q-(add*10000);//8876
f1 = Pad (f1);
}
// ToDo: Implement a division by float
inline void Div (int & b1, int & f1, int num)
{
// 123.9876 / 25 = 4.959504
int b2 = b1/num; // 4
long q = b1 - (b2*num); // 23
f1 = ((q*10000) + f1) / num; // (23000+9876) / 25 = 9595
b1 = b2;
f1 = Pad (f1);
}

You are thinking in the wrong base for a simple fixed point implementation. It is much easier if you use the bits for the decimal place. e.g. using 16 bits for the integer part and 16 bits for the decimal part (range -32767/32767, precision of 1/2^16 which is a lot higher precision than you have).
The best part is that addition and subtraction are simple (just add the two parts together). Multiplication is a little bit trickier: you have to be aware of overflow and so it helps to do the multiplication in 64 bit. You also have to shift the result after the multiplication (by however many bits are in your decimal).
typedef int fixed16;
fixed16 mult_f(fixed16 op1, fixed16 op2)
{
/* you may need to do something tricky with upper and lower if you don't
* have native 64 bit but the compiler might do it for us if we are lucky
*/
uint64_t tmp;
tmp = (op1 * op2) >> 16;
/* add in error handling for overflow if you wish - this just wraps */
return tmp & 0xFFFFFFFF;
}
Division is similar.
Someone might have implemented almost exactly what you need (or that can be hacked to make it work) that's called libfixmath

If you decide to use fixed-point, the whole number (i.e both int and fractional parts) should be in the same base. Using binary for the int part and decimal for the fractional part as above is not very optimal and slows down the calculation. Using binary fixed-point you'll only need to shift an appropriate amount after each operation instead of long adjustments like your idea. If you want to use Q16.16 then libfixmath as dave mentioned above is a good choice. If you want a different precision or floating point position such as Q14.18, Q19.13 then write your own library or modify some library for your own use. Some examples
BoostGSoC15/fixed_point
https://github.com/johnmcfarlane/cnl
See also What's the best way to do fixed-point math?
If you want a larger range then floating point maybe the better choice. Write a library as your own requirements, choose a format that is easy to implement and easy to achieve good performance in software, no need to follow IEEE 754 specifications (which is only fast with hardware implementations due to the odd number of bits and strange exponent bits' position) unless you intend to exchange data with other devices. For example a format of exp.sign.significand with 7 exponent bits followed by a sign bit and then 24 bits of significand. The exponent doesn't need to be biased, so to get the base only an arithmetic shift by 25 is needed, the sign bit will also be extended. But in case the shift is slower than a subtraction then excess-n is better.

Faster way of finding multiple of double

If have the following C function, used to determine if one number is a multiple of another to an arbirary tolerance
#include <math.h>
#define TOLERANCE 0.0001
int IsMultipleOf(double x,double mod)
{
return(fabs(fmod(x, mod)) < TOLERANCE);
}
It works fine, but profiling shows it to be very slow, to the extent that it has become a candidate for optimization. About 75% of the time is spent in modulo and the remaining in fabs. I'm trying to figure a way of speeding things up, using something like a look-up table. The parameter x changes regularly, whereas mod changes infrequently. The number of possible values of x is small enough that the space for a look-up would not be an issue, typically it will be one of a few hundred possible values. I can get rid of the fabs easily enough, but can't figure out a reasonable alternative to the modulo. Any ideas on how to optimize the above?
Edit The code will be running on a wide range of Windows desktop and mobile devices, hence processors could include Intel, AMD on desktop, and ARM or SH4 on mobile devices. VisualStudio 2008 is the compiler.

Do you really have to use modulo for this?
Wouldn't it be possible to just result = x / mod and then check if the decimal part of result is close to 0. For instance:
11 / 5.4999 = 2.000003 ==> 0.000003 < TOLERANCE
Or something like that.

Division (floating point or not, fmod in your case) is often an operation where the execution time varies a lot depending on the cpu and compiler:
gcc has a builtin replacement for
that if you give it the right compile
flags or if you use __builtin_fmod
explicitly. This then might map the
operation on a small number of
assembler instructions.
there may be special units like SSE
on intel processors where this
operation is implemented more
efficiently
By such tricks, depending on your environment (you didn't tell which) the time may vary from some clock cycles to some hundred. I think best is to look into the documentation of your compiler and cpu for that particular operation.

The following is probably overkill, and sub-optimal. But for what it is worth here is one way on how to do it.
We know the format of the double ...
1 bit for the sign
11 bits for the biased exponent
52 fraction bits
Let ...
value = x / mod;
exp = exponent bits of value - BIAS;
lsb = least sig bit of value's fraction bits;
Once you have that ...
/*
* If applying the exponent would eliminate the fraction bits
* then for double precision resolution it is a multiple.
* Note: lsb may require some massaging.
*/
if (exp > lsb)
return (true);
if (exp < 0)
return (false);
The only case remaining is the tolerance case. Build your double so that you are getting rid of all the digits to the left of the decimal.
sign bit is zero (positive)
exponent is the BIAS (1023 I think ... look it up to be sure)
shift the fraction bits as appropriate
Now compare it against your tolerance.

I think you need to inspect the bowels of your C RTL fmod() function: X86 FPU's have 'FPREM/FPREM1' instructions which computes remainders by repeated subtraction.
While floating point division is a single instruction, it seems you may need to call FPREM repeatedly to get the right answer for modulus, so your RTL may not use it.

I have not tested this at all, but from the way I understand fmod this should be equivalent inlined, which might let the compiler optimize it better, though I would have thought that the compiler's math library (or builtins) would work just as well. (also, I don't even know for sure if this is correct).
#include <math.h>
int IsMultipleOf(double x, double mod) {
long n = x / mod; // You should probably test for /0 or NAN result here
double new_x = mod * n;
double delta = x - new_x;
return fabs(delta) < TOLERANCE; // and for NAN result from fabs
}

Maybe you can get away with long long instead of double if you have comparable scale of data. For example long long would be enough for over 60 astronomical units in micrometer resolution.

Does it need to be double precision ? Depending on how good your math library is, this ought to be faster:
#include <math.h>
#define TOLERANCE 0.0001f
bool IsMultipleOf(float x, float mod)
{
return(fabsf(fmodf(x, mod)) < TOLERANCE);
}

I presume modulo looks a little like this on the inside:
mod(x,m) {
while (x > m) {
x = x - m
}
return x
}
I think that through some sort of search i could be optimised: eg:
fastmod(x,m) {
q = 1
while (m * q < x) {
q = q * 2
}
return mod((x - (q / 2) * m), m)
}
You might even choose to replace the finall call to mod with annother call to fastmod, adding the condition that if x < m then to return x.

Why is floor() so slow?

I wrote some code recently (ISO/ANSI C), and was surprised at the poor performance it achieved. Long story short, it turned out that the culprit was the floor() function. Not only it was slow, but it did not vectorize (with Intel compiler, aka ICL).
Here are some benchmarks for performing floor for all cells in a 2D matrix:
VC: 0.10
ICL: 0.20
Compare that to a simple cast:
VC: 0.04
ICL: 0.04
How can floor() be that much slower than a simple cast?! It does essentially the same thing (apart for negative numbers).
2nd question: Does someone know of a super-fast floor() implementation?
PS: Here is the loop that I was benchmarking:
void Floor(float *matA, int *intA, const int height, const int width, const int width_aligned)
{
float *rowA=NULL;
int *intRowA=NULL;
int row, col;
for(row=0 ; row<height ; ++row){
rowA = matA + row*width_aligned;
intRowA = intA + row*width_aligned;
#pragma ivdep
for(col=0 ; col<width; ++col){
/*intRowA[col] = floor(rowA[col]);*/
intRowA[col] = (int)(rowA[col]);
}
}
}

A couple of things make floor slower than a cast and prevent vectorization.
The most important one:
floor can modify the global state. If you pass a value that is too huge to be represented as an integer in float format, the errno variable gets set to EDOM. Special handling for NaNs is done as well. All this behavior is for applications that want to detect the overflow case and handle the situation somehow (don't ask me how).
Detecting these problematic conditions is not simple and makes up more than 90% of the execution time of floor. The actual rounding is cheap and could be inlined/vectorized. Also It's a lot of code, so inlining the whole floor-function would make your program run slower.
Some compilers have special compiler flags that allow the compiler to optimize away some of the rarely used c-standard rules. For example GCC can be told that you're not interested in errno at all. To do so pass -fno-math-errno or -ffast-math. ICC and VC may have similar compiler flags.
Btw - You can roll your own floor-function using simple casts. You just have to handle the negative and positive cases differently. That may be a lot faster if you don't need the special handling of overflows and NaNs.

If you are going to convert the result of the floor() operation to an int, and if you aren't worried about overflow, then the following code is much faster than (int)floor(x):
inline int int_floor(double x)
{
int i = (int)x; /* truncate */
return i - ( i > x ); /* convert trunc to floor */
}

Branch-less Floor and Ceiling (better utilize the pipiline) no error check
int f(double x)
{
return (int) x - (x < (int) x); // as dgobbi above, needs less than for floor
}
int c(double x)
{
return (int) x + (x > (int) x);
}
or using floor
int c(double x)
{
return -(f(-x));
}

The actual fastest implementation for a large array on modern x86 CPUs would be
change the MXCSR FP rounding mode to round towards -Infinity (aka floor). In C, this should be possible with fenv stuff, or _mm_getcsr / _mm_setcsr.
loop over the array doing _mm_cvtps_epi32 on SIMD vectors, converting 4 floats to 32-bit integer using the current rounding mode. (And storing the result vectors to the destination.)
cvtps2dq xmm0, [rdi] is a single micro-fused uop on any Intel or AMD CPU since K10 or Core 2. (https://agner.org/optimize/) Same for the 256-bit AVX version, with YMM vectors.
restore the current rounding mode to the normal IEEE default mode, using the original value of the MXCSR. (round-to-nearest, with even as a tiebreak)
This allows loading + converting + storing 1 SIMD vector of results per clock cycle, just as fast as with truncation. (SSE2 has a special FP->int conversion instruction for truncation, exactly because it's very commonly needed by C compilers. In the bad old days with x87, even (int)x required changing the x87 rounding mode to truncation and then back. cvttps2dq for packed float->int with truncation (note the extra t in the mnemonic). Or for scalar, going from XMM to integer registers, cvttss2si or cvttsd2si for scalar double to scalar integer.
With some loop unrolling and/or good optimization, this should be possible without bottlenecking on the front-end, just 1-per-clock store throughput assuming no cache-miss bottlenecks. (And on Intel before Skylake, also bottlenecked on 1-per-clock packed-conversion throughput.) i.e. 16, 32, or 64 bytes per cycle, using SSE2, AVX, or AVX512.
Without changing the current rounding mode, you need SSE4.1 roundps to round a float to the nearest integer float using your choice of rounding modes. Or you could use one of the tricks shows in other answers that work for floats with small enough magnitude to fit in a signed 32-bit integer, since that's your ultimate destination format anyway.)
(With the right compiler options, like -fno-math-errno, and the right -march or -msse4 options, compilers can inline floor using roundps, or the scalar and/or double-precision equivalent, e.g. roundsd xmm1, xmm0, 1, but this costs 2 uops and has 1 per 2 clock throughput on Haswell for scalar or vectors. Actually, gcc8.2 will inline roundsd for floor even without any fast-math options, as you can see on the Godbolt compiler explorer. But that's with -march=haswell. It's unfortunately not baseline for x86-64, so you need to enable it if your machine supports it.)

Yes, floor() is extremely slow on all platforms since it has to implement a lot of behaviour from the IEEE fp spec. You can't really use it in inner loops.
I sometimes use a macro to approximate floor():
#define PSEUDO_FLOOR( V ) ((V) >= 0 ? (int)(V) : (int)((V) - 1))
It does not behave exactly as floor(): for example, floor(-1) == -1 but PSEUDO_FLOOR(-1) == -2, but it's close enough for most uses.

An actually branchless version that requires a single conversion between floating point and integer domains would shift the value x to all positive or all negative range, then cast/truncate and shift it back.
long fast_floor(double x)
{
const unsigned long offset = ~(ULONG_MAX >> 1);
return (long)((unsigned long)(x + offset) - offset);
}
long fast_ceil(double x) {
const unsigned long offset = ~(ULONG_MAX >> 1);
return (long)((unsigned long)(x - offset) + offset );
}
As pointed in the comments, this implementation relies on the temporary value x +- offset not overflowing.
On 64-bit platforms, the original code using int64_t intermediate value will result in three instruction kernel, the same available for int32_t reduced range floor/ceil, where |x| < 0x40000000 --
inline int floor_x64(double x) {
return (int)((int64_t)(x + 0x80000000UL) - 0x80000000LL);
}
inline int floor_x86_reduced_range(double x) {
return (int)(x + 0x40000000) - 0x40000000;
}

They do not do the same thing. floor() is a function. Therefore, using it incurs a function call, allocating a stack frame, copying of parameters and retrieving the result.
Casting is not a function call, so it uses faster mechanisms (I believe that it may use registers to process the values).
Probably floor() is already optimized.
Can you squeeze more performance out of your algorithm? Maybe switching rows and columns may help? Can you cache common values? Are all your compiler's optimizations on? Can you switch an operating system? a compiler?
Jon Bentley's Programming Pearls has a great review of possible optimizations.

Fast double round
double round(double x)
{
return double((x>=0.5)?(int(x)+1):int(x));
}
Terminal log
test custom_1 8.3837
test native_1 18.4989
test custom_2 8.36333
test native_2 18.5001
test custom_3 8.37316
test native_3 18.5012
Test
void test(char* name, double (*f)(double))
{
int it = std::numeric_limits<int>::max();
clock_t begin = clock();
for(int i=0; i<it; i++)
{
f(double(i)/1000.0);
}
clock_t end = clock();
cout << "test " << name << " " << double(end - begin) / CLOCKS_PER_SEC << endl;
}
int main(int argc, char **argv)
{
test("custom_1",round);
test("native_1",std::round);
test("custom_2",round);
test("native_2",std::round);
test("custom_3",round);
test("native_3",std::round);
return 0;
}
Result
Type casting and using your brain is ~3 times faster than using native functions.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight