CUDA __float_as_int in acosf implementation - c

CUDA C's maths function implementation (cuda/math_function.h) of acosf contains the passage:
if (__float_as_int(a) < 0) {
t1 = CUDART_PI_F - t1;
}
where a and t1 are floats and CUDART_PI_F is a float previously set to a numerical value close to the mathematical constant Pi.
I am trying to understand what the conditional (if-clause) is testing for and what would be the C equivalent of it or the function/macro __float_as_int(a). I searched for the implementation of __float_as_int() but without success. It seems that __float_as_int() is a built-in macro or function to NVIDIA NVCC. Looking at the PTX that NVCC produces out of the above passage:
.reg .u32 %r<4>;
.reg .f32 %f<46>;
.reg .pred %p<4>;
// ...
mov.b32 %r1, %f1;
mov.s32 %r2, 0;
setp.lt.s32 %p2, %r1, %r2;
selp.f32 %f44, %f43, %f41, %p2;
it becomes clear that __float_as_int() is not a float to int rounding. (This would have yielded a cvt.s32.f32.) Instead it assigns the float %f1 as a bit-copy (b32) to %r1 (notice: %r1 is of type u32 (unsigned int)!!) and then compares %r1 as if it was a s32 (signed int, confusing!!) with %r2 (who's value is 0).
To me this looks a little odd. But obviously it is correct.
Can someone explain what's going on and especially explain what __float_as_int() is doing in the context of the if-clause testing for being negative (<0)? .. and provide a C equivalent of the if-clause and/or __float_as_int() marco ?

__float_as_int reinterprets float as an int. int is <0 when it has most significant bit on. For float it also means that the sign bit is on, but it does not exactly mean that number is negative (e.g. it can be 'negative zero'). It can be faster to check then checking if float is < 0.0.
C function could look like:
int __float_as_int(float in) {
union fi { int i; float f; } conv;
conv.f = in;
return conv.i;
}
In some other version of this header __cuda___signbitf is used instead.

Related

Float inputs for which sinf and sin return different results?

I'm trying to understand something about sin and sinf from math.h.
I understand that their types differ: the former takes and returns doubles, and the latter takes and returns floats.
However, GCC still compiles my code if I call sin with float arguments:
#include <stdio.h>
#include <math.h>
#define PI 3.14159265
int main ()
{
float x, result;
x = 135 / 180 * PI;
result = sin (x);
printf ("The sin of (x=%f) is %f\n", x, result);
return 0;
}
By default, all compiles just fine (even with -Wall, -std=c99 and -Wpedantic; I need to work with C99). GCC won't complain about me passing floats to sin. If I enable -Wconversion then GCC tells me:
warning: conversion to ‘float’ from ‘double’ may alter its value [-Wfloat-conversion]
result = sin (x);
^~~
So my question is: is there a float input for which using sin, like above, and (implicitly) casting the result back to float, will result in a value that is different from that obtained using sinf?
This program finds three examples on my machine:
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
int main()
{
int i;
float f, f1, f2;
for(i = 0; i < 10000; i++) {
f = (float)rand() / RAND_MAX;
float f1 = sinf(f);
float f2 = sin(f);
if(f1 != f2) printf("jackpot: %.8f %.8f %.8f\n", f, f1, f2);
}
}
I got:
jackpot: 0.98704159 0.83439910 0.83439904
jackpot: 0.78605396 0.70757037 0.70757031
jackpot: 0.78636044 0.70778692 0.70778686
This will find all the float input values in the range 0.0 to 2 * M_PI where (float)sin(input) != sinf(input):
#include <stdio.h>
#include <math.h>
#include <float.h>
#ifndef M_PI
#define M_PI 3.14159265358979323846
#endif
int main(void)
{
for (float in = 0.0; in < 2 * M_PI; in = nextafterf(in, FLT_MAX)) {
float sin_result = (float)sin(in);
float sinf_result = sinf(in);
if (sin_result != sinf_result) {
printf("sin(%.*g) = %.*g, sinf(%.*g) = %.*g\n",
FLT_DECIMAL_DIG, in, FLT_DECIMAL_DIG, sin_result,
FLT_DECIMAL_DIG, in, FLT_DECIMAL_DIG, sinf_result);
}
}
return 0;
}
There are 1020963 such inputs on my amd64 Linux system with glibc 2.32.
float precision is approximately 6 significant figures decimal, while double is good for about 15. (It is approximate because they are binary floating point values not decimal floating point).
As such for example: a double value 1.23456789 will become 1.23456xxx as a float where xxx are unlikely to be 789 in this case.
Clearly not all (in fact very few) double values are exactly representable by float, so will change value when down-converted.
So for:
double a = 1.23456789 ;
float b = a ;
printf( "double: %.10f\n", a ) ;
printf( "float: %.10f\n", b ) ;
The result in my test was:
double: 1.2345678900
float: 1.2345678806
As you can see the float in fact retained 9 significant figures in this case, but it is by no means guaranteed for all possible values.
In your test you have limited the number of instances of mismatch because of the limited and finite range of rand() and also because f itself is float. Consider:
int main()
{
unsigned mismatch_count = 0 ;
unsigned iterations = 0 ;
for( double f = 0; f < 6.28318530718; f += 0.000001)
{
float f1 = sinf(f);
float f2 = sin(f);
iterations++ ;
if(f1 != f2)
{
mismatch_count++ ;
}
}
printf("%f%%\n", (double)mismatch_count/iterations* 100.0);}
In my test about 55% of comparisons mismatched. Changing f to float, the mismatches reduced to 1.3%.
So in your test, you see few mismatches because of the constraints of your method of generating f and its type. In the general case the issue is much more obvious.
In some cases you might see no mismatches - an implementation may simply implement sinf() using sin() with explicit casts. The compiler warning is for the general case of implicitly casting a double to a float without reference to any operations performed prior to the conversion.
However, GCC still compiles my code if I call sin with float arguments:
Yes, this is because they are implicitly converted to double (because sin() requires a float), and back to float (because sin() returns a double) on entering and exiting from the sinf() function. See below why it is better to use sinf() in this case, instead of having only one function.
You have included math.h which has prototypes for both function calls:
double sin(double);
float sinf(float);
And so, the compiler knows that to use sin() it is necessary a conversion from float to double so it compiles a conversion before calling, and also compiles a conversion from double to float in the result from sin().
In case you have not #include <math.h> and you ignored the compiler warning telling you are calling a function sin() with no prototype, the compiler should have also converted first the float to double (because on nonspecified argument types this is how it mus proceed) and pass the double data to the function (which is assumed to return an int in this case, that will provoke a serious Undefined Behaviour)
In case you have used the sinf() function (with the proper prototype), and passed a float, then no conversion should be compiled, the float is passed as such with no type conversion, and the returned value is assigned to a float variable, also with no conversion. So everything goes fine with no conversion, this makes the fastest code.
In case you have used the sinf() function (with no prototype), and passed a float, this float would be converted to a double and passed as such to sinf(), resulting in undefined behaviour. In case somehow sinf() returned properly, an int result (that could have something to do with the calculation or not, as per UB) would be converted into float type (should this be possible) and assigned to the result value.
In the case mentioned above, in case you are operating on floats, it is better to use sinf() as it takes less to execute (it has less iterations to do, as less precision is required in them) and the two conversions (from float to double and back from double to float) have not to be compiled in, in the binary code output by the compiler.
There are some systems where computations on float are an order of magnitude faster than computations on double. The primary purpose of sinf is to allow trigonometric calculations to be performed efficiently on such systems in cases where the lower precision of float would be adequate to satisfy application needs. Converting a value to float, calling sin, and converting the result to float would always yield a value that either matched that of sinf or was more accurate(*), and on some implementations that would in fact be the most efficient way of implementing sinf. On some other systems, however, such an approach would be more than an order of magnitude slower than using a purpose-designed function to evaluate the sine of a float.
(*) Note that for arguments outside the range +/- π/2, the most mathematically accurate way of computing sin(x) for an exact specified value of x might not be the most accurate way of computing what the calling code wants to know. If an application computes sinf(angle * (2.0f * 3.14159265f)), when angle is 0.5, having the function (double)3.1415926535897932385-(float)3.14159265f may be more "mathematically accurate" than having it return sin(angle-(2.0f*3.14159265f)), but the latter would more accurately represent the sine of the angle the code was actually interested in.

How to implement wrapping signed int addition in C

This is a complete rewrite of the question. Hopefully it is clearer now.
I want to implement in C a function that performs addition of signed ints with wrapping in case of overflow.
I want to target mainly the x86-64 architecture, but of course the more portable the implementation is the better. I'm also concerned mostly about producing decent assembly code through gcc, clang, icc, and whatever is used on Windows.
The goal is twofold:
write correct C code that doesn't fall into the undefined behavior blackhole;
write code that gets compiled to decent machine code.
By decent machine code I mean a single leal or a single addl instruction on machines which natively support the operation.
I'm able to satisfy either of the two requisites, but not both.
Attempt 1
The first implementation that comes to mind is
int add_wrap(int x, int y) {
return (unsigned) x + (unsigned) y;
}
This seems to work with gcc, clang and icc. However, as far as I know, the C standard doesn't specify the cast from unsigned int to signed int, leaving freedom to the implementations (see also here).
Otherwise, if the new type is signed and the value cannot be represented in it; either the result is implementation-defined or an implementation-defined signal is raised.
I believe most (all?) major compilers do the expected conversion from unsigned to int, meaning that they take the correct representative modulus 2^N, where N is the number of bits, but it's not mandated by the standard so it cannot be relied upon (stupid C standard hits again). Also, while this is the simplest thing to do on two's complement machines, it is impossible on ones' complement machines, because there is a class which is not representable: 2^(N/2).
Attempt 2
According to the clang docs, one can use __builtin_add_overflow like this
int add_wrap(int x, int y) {
int res;
__builtin_add_overflow(x, y, &res);
return res;
}
and this should do the trick with clang, because the docs clearly say
If possible, the result will be equal to mathematically-correct result and the builtin will return 0. Otherwise, the builtin will return 1 and the result will be equal to the unique value that is equivalent to the mathematically-correct result modulo two raised to the k power, where k is the number of bits in the result type.
The problem is that in the GCC docs they say
These built-in functions promote the first two operands into infinite precision signed type and perform addition on those promoted operands. The result is then cast to the type the third pointer argument points to and stored there.
As far as I know, casting from long int to int is implementation specific, so I don't see any guarantee that this will result in the wrapping behavior.
As you can see [here][godbolt], GCC will also generate the expected code, but I wanted to be sure that this is not by chance ans is indeed part of the specification of __builtin_add_overflow.
icc also seems to produce something reasonable.
This produces decent assembly, but relies on intrinsics, so it's not really standard compliant C.
Attempt 3
Follow the suggestions of those pedantic guys from SEI CERT C Coding Standard.
In their CERT INT32-C recommendation they explain how to check in advance for potential overflow. Here is what comes out following their advice:
#include <limits.h>
int add_wrap(int x, int y) {
if ((x > 0) && (y > INT_MAX - x))
return (x + INT_MIN) + (y + INT_MIN);
else if ((x < 0) && (y < INT_MIN - x))
return (x - INT_MIN) + (y - INT_MIN);
else
return x + y;
}
The code performs the correct checks and compiles to leal with gcc, but not with clang or icc.
The whole CERT INT32-C recommendation is complete garbage, because it tries to transform C into a "safe" language by forcing the programmers to perform checks that should be part of the definition of the language in the first place. And in doing so it forces also the programmer to write code which the compiler can no longer optimize, so what is the reason to use C anymore?!
Edit
The contrast is between compatibility and decency of the assembly generated.
For instance, with both gcc and clang the two following functions which are supposed to do the same get compiled to different assembly.
f is bad in both cases, g is good in both cases (addl+jo or addl+cmovnol). I don't know if jo is better than cmovnol, but the function g is consistently better than f.
#include <limits.h>
signed int f(signed int si_a, signed int si_b) {
signed int sum;
if (((si_b > 0) && (si_a > (INT_MAX - si_b))) ||
((si_b < 0) && (si_a < (INT_MIN - si_b)))) {
return 0;
} else {
return si_a + si_b;
}
}
signed int g(signed int si_a, signed int si_b) {
signed int sum;
if (__builtin_add_overflow(si_a, si_b, &sum)) {
return 0;
} else {
return sum;
}
}
A bit like #Andrew's answer without the memcpy().
Use a union to negate the need for memcpy(). With C2x, we are sure that int is 2's compliment.
int add_wrap(int x, int y) {
union {
unsigned un;
int in;
} u = {.un = (unsigned) x + (unsigned) y};
return u.in;
}
For those who like 1-liners, use a compound literal.
int add_wrap2(int x, int y) {
return ( union { unsigned un; int in; }) {.un = (unsigned) x + (unsigned) y}.in;
}
I'm not so sure because of the rules for casting from unsigned to signed
You exactly quoted the rules. If you convert from a unsigned value to a signed one, then the result is implementation-defined or a signal is raised. In simple words, what will happen is described by your compiler.
For example the gcc9.2.0 compiler has the following to in it's documentation about implementation defined behavior of integers:
The result of, or the signal raised by, converting an integer to a signed integer type when the value cannot be represented in an object of that type (C90 6.2.1.2, C99 and C11 6.3.1.3).
For conversion to a type of width N, the value is reduced modulo 2^N to be within range of the type; no signal is raised.
I had to do something similar; however, I was working with known width types from stdint.h and needed to handle wrapping 32-bit signed integer operations. The implementation below works because stdint types are required to be 2's complement. I was trying to emulate the behaviour in Java, so I had some Java code generate a bunch of test cases and have tested on clang, gcc and MSVC.
inline int32_t add_wrap_i32(int32_t a, int32_t b)
{
const int64_t a_widened = a;
const int64_t b_widened = b;
const int64_t sum = a_widened + b_widened;
return (int32_t)(sum & INT64_C(0xFFFFFFFF));
}
inline int32_t sub_wrap_i32(int32_t a, int32_t b)
{
const int64_t a_widened = a;
const int64_t b_widened = b;
const int64_t difference = a_widened - b_widened;
return (int32_t)(difference & INT64_C(0xFFFFFFFF));
}
inline int32_t mul_wrap_i32(int32_t a, int32_t b)
{
const int64_t a_widened = a;
const int64_t b_widened = b;
const int64_t product = a_widened * b_widened;
return (int32_t)(product & INT64_C(0xFFFFFFFF));
}
It seems ridiculous, but I think that the recommended method is to use memcpy. Apparently all modern compilers optimize the memcpy away and it ends up doing just what you're hoping in the first place -- preserving the bit pattern from the unsigned addition.
int a;
int b;
unsigned u = (unsigned)a + b;
int result;
memcpy(&result, &u, sizeof(result));
On x86 clang with optimization, this is a single instruction if the destination is a register.

Issue working with uint2 and CUDA

Recently I started working with CUDA and Ethereum and I found a little bit of code snipet on a function that when I try to port to a cuda file I get some errors.
Here is the code snippet:
void keccak_f1600_round(uint2* a, uint r, uint out_size)
{
#if !__ENDIAN_LITTLE__
for (uint i = 0; i != 25; ++i)
a[i] = make_uint2(a[i].y, a[i].x);
#endif
uint2 b[25];
uint2 t;
// Theta
b[0] = a[0] ^ a[5] ^ a[10] ^ a[15] ^ a[20];
#if !__ENDIAN_LITTLE__
for (uint i = 0; i != 25; ++i)
a[i] = make_uint2(a[i].y, a[i].x);
#endif
}
The error I am getting concern the b[0] line and is:
error: no operator "^=" matches these operands operand types are: uint2 ^= uint2
TO be honest I don't have a lot of experience with uint2 and cuda and that is why I am asking what should I do to correct this issue.
The exclusive-or operator works with unsigned long long, but not with uint2 (which for CUDA is a built-in struct containing two unsigned ints).
To make the code work, there are several options. Some that come to my mind:
you can use reinterpret-cast<unsigned long long &> before each uint2 in the line that does the exclusive-or (see How to use reinterpret_cast in C++?)
you can rewrite the code to use unsigned long long types everywhere you use uint2 now. This probably produces the most maintainable code.
you can rewrite the line for the exclusive-or among uint2 types, as a pair of exclusive-or lines using the .x and .y members of the uint2, as each is an unsigned int type.
you can define a union type to allow access to the data that is currently type uint2, as either a uint2 or a unsigned long long.
you can overload the ^ exclusive-or operator to work with uint2 types.
you can replace the line that produces the error with asm statements to generate the PTX code to perform the exclusive-or for you. See http://docs.nvidia.com/cuda/inline-ptx-assembly/index.html#using-inline-ptx-assembly-in-cuda
uint2 is simply a struct, you'll need to implement ^ using a[].x and a[].y. I couldn't find where the builtin declarations are but Are there advantages to using the CUDA vector types? has a good description of their use.

matlab and c differ with cos function

I have a program implemented in matlab and the same program in c, and the results differ.
I am bit puzzled that the cos function does not return the exact same result.
I use the same computer, Intel Core 2 Duo, and 8 bytes double data type in both cases.
Why does the result differ?
Here is the test:
c:
double a = 2.89308776595231886830;
double b = cos(a);
printf("a = %.50f\n", a);
printf("b = %.50f\n", b);
printf("sizeof(a): %ld\n", sizeof(a));
printf("sizeof(b): %ld\n", sizeof(b));
a = 2.89308776595231886830106304842047393321990966796875
b = -0.96928123535654842068964853751822374761104583740234
sizeof(a): 8
sizeof(b): 8
matlab:
a = 2.89308776595231886830
b = cos(a);
fprintf('a = %.50f\n', a);
fprintf('b = %.50f\n', b);
whos('a')
whos('b')
a = 2.89308776595231886830106304842047393321990966796875
b = -0.96928123535654830966734607500256970524787902832031
Name Size Bytes Class Attributes
a 1x1 8 double
Name Size Bytes Class Attributes
b 1x1 8 double
So, b differ a bit (very slightly, but enough to make my debuging task difficult)
b = -0.96928123535654842068964853751822374761104583740234 c
b = -0.96928123535654830966734607500256970524787902832031 matlab
I use the same computer, Intel Core 2 Duo, and 8 bytes double data type.
Why does the result differ?
does matlab do not use the cos function hardware built-in in Intel?
Is there a simple way to use the same cos function in matlab and c (with exact results), even if a bit slower, so that I can safely compare the results of my matlab and c program?
Update:
thanks a lot for your answers!
So, as you have pointed out, the cos function for matlab and c differ.
That's amazing! I thought they were using the cos function built-in in the Intel microprocessor.
The cos version of matlab is equal (at least for this test) to the one of matlab.
you can try from matlab also: b=java.lang.Math.cos(a)
Then, I did a small MEX function to use the cos c version from within matlab, and it works fine; This allows me to debug the my program (the same one implemented in matlab and c) and see at what point they differ, which was the purpose of this post.
The only thing is that calling the MEX c cos version from matlab is way too slow.
I am now trying to call the Java cos function from c (as it is the same from matlab), see if that goes faster.
Floating point numbers are stored in binary, not decimal. A double precision float has 52 bits of precision, which translates to roughly 15 significant decimal places. In other words, the first 15 nonzero decimal digits of a double printed in decimal are enough to uniquely determine which double was printed.
As a diadic rational, a double has an exact representation in decimal, which takes many more decimal places than 15 to represent (in your case, 52 or 53 places, I believe). However, the standards for printf and similar functions do not require the digits past the 15th to be correct; they could be complete nonsense. I suspect one of the two environments is printing the exact value, and the other is printing a poor approximation, and that in reality both correspond to the exact same binary double value.
Using the script at http://www.mathworks.com/matlabcentral/fileexchange/1777-from-double-to-string
the difference between the two numbers is only in the last bit:
octave:1> bc = -0.96928123535654842068964853751822374761104583740234;
octave:2> bm = -0.96928123535654830966734607500256970524787902832031;
octave:3> num2bin(bc)
ans = -.11111000001000101101000010100110011110111001110001011*2^+0
octave:4> num2bin(bm)
ans = -.11111000001000101101000010100110011110111001110001010*2^+0
One of them must be closer to the "correct" answer, assuming the value given for a is exact.
>> be = vpa('cos(2.89308776595231886830)',50)
be =
-.96928123535654836529707365425580405084360377470583
>> bc = -0.96928123535654842068964853751822374761104583740234;
>> bm = -0.96928123535654830966734607500256970524787902832031;
>> abs(bc-be)
ans =
.5539257488326242e-16
>> abs(bm-be)
ans =
.5562972757925323e-16
So, the C library result is more accurate.
For the purposes of your question, however, you should not expect to get the same answer in matlab and whichever C library you linked with.
The result is the same up to 15 decimal places, I suspect that is sufficient for almost all applications and if you require more you should probably be implementing your own version of cosine anyway such that you are in control of the specifics and your code is portable across different C compilers.
They will differ because they undoubtedly use different methods to calculate the approximation to the result or iterate a different number of times. As cosine is defined as an infinite series of terms an approximation must be used for its software implementation. The CORDIC algorithm is one common implementation.
Unfortunately, I don't know the specifics of the implementation in either case, indeed the C one will depend on which C standard library implementation you are using.
As others have explained, when you enter that number directly in your source code, not all the fraction digits will be used, as you only get 15/16 decimal places for precision. In fact, they get converted to the nearest double value in binary (anything beyond the fixed limit of digits is dropped).
To make things worse, and according to #R, IEEE 754 tolerates error in the last bit when using the cosine function. I actually ran into this when using different compilers.
To illustrate, I tested with the following MEX file, once compiled with the default LCC compiler, and then using VS2010 (I am on WinXP 32-bit).
In one function we directly call the C functions (mexPrintf is simply a macro #define as printf). In the other, we call mexEvalString to evaulate stuff in the MATLAB engine (equivalent to using the command prompt in MATLAB).
prec.c
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include "mex.h"
void c_test()
{
double a = 2.89308776595231886830L;
double b = cos(a);
mexPrintf("[C] a = %.25Lf (%16Lx)\n", a, a);
mexPrintf("[C] b = %.25Lf (%16Lx)\n", b, b);
}
void matlab_test()
{
mexEvalString("a = 2.89308776595231886830;");
mexEvalString("b = cos(a);");
mexEvalString("fprintf('[M] a = %.25f (%bx)\\n', a, a)");
mexEvalString("fprintf('[M] b = %.25f (%bx)\\n', b, b)");
}
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
matlab_test();
c_test();
}
copmiled with LCC
>> prec
[M] a = 2.8930877659523189000000000 (4007250b32d9c886)
[M] b = -0.9692812353565483100000000 (bfef045a14cf738a)
[C] a = 2.8930877659523189000000000 ( 32d9c886)
[C] b = -0.9692812353565484200000000 ( 14cf738b) <---
compiled with VS2010
>> prec
[M] a = 2.8930877659523189000000000 (4007250b32d9c886)
[M] b = -0.9692812353565483100000000 (bfef045a14cf738a)
[C] a = 2.8930877659523189000000000 ( 32d9c886)
[C] b = -0.9692812353565483100000000 ( 14cf738a) <---
I compile the above using: mex -v -largeArrayDims prec.c, and switch between the backend compilers using: mex -setup
Note that I also tried to print the hexadecimal representation of the numbers. I only managed to show the lower half of binary double numbers in C (perhaps you can get the other half using some sort of bit manipulations, but I'm not sure how!)
Finally, if you need more precision in you calculations, consider using a library for variable precision arithmetic. In MATLAB, if you have access to the Symbolic Math Toolbox, try:
>> a = sym('2.89308776595231886830');
>> b = cos(a);
>> vpa(b,25)
ans =
-0.9692812353565483652970737
So you can see that the actual value is somewhere between the two different approximations I got above, and in fact they are all equal up to the 15th decimal place:
-0.96928123535654831.. # 0xbfef045a14cf738a
-0.96928123535654836.. # <--- actual value (cannot be represented in 64-bit)
-0.96928123535654842.. # 0xbfef045a14cf738b
^
15th digit --/
UPDATE:
If you want to correctly display the hexadecimal representation of floating point numbers in C, use this helper function instead (similar to NUM2HEX function in MATLAB):
/* you need to adjust for double/float datatypes, big/little endianness */
void num2hex(double x)
{
unsigned char *p = (unsigned char *) &x;
int i;
for(i=sizeof(double)-1; i>=0; i--) {
printf("%02x", p[i]);
}
}

Why is floor() so slow?

I wrote some code recently (ISO/ANSI C), and was surprised at the poor performance it achieved. Long story short, it turned out that the culprit was the floor() function. Not only it was slow, but it did not vectorize (with Intel compiler, aka ICL).
Here are some benchmarks for performing floor for all cells in a 2D matrix:
VC: 0.10
ICL: 0.20
Compare that to a simple cast:
VC: 0.04
ICL: 0.04
How can floor() be that much slower than a simple cast?! It does essentially the same thing (apart for negative numbers).
2nd question: Does someone know of a super-fast floor() implementation?
PS: Here is the loop that I was benchmarking:
void Floor(float *matA, int *intA, const int height, const int width, const int width_aligned)
{
float *rowA=NULL;
int *intRowA=NULL;
int row, col;
for(row=0 ; row<height ; ++row){
rowA = matA + row*width_aligned;
intRowA = intA + row*width_aligned;
#pragma ivdep
for(col=0 ; col<width; ++col){
/*intRowA[col] = floor(rowA[col]);*/
intRowA[col] = (int)(rowA[col]);
}
}
}
A couple of things make floor slower than a cast and prevent vectorization.
The most important one:
floor can modify the global state. If you pass a value that is too huge to be represented as an integer in float format, the errno variable gets set to EDOM. Special handling for NaNs is done as well. All this behavior is for applications that want to detect the overflow case and handle the situation somehow (don't ask me how).
Detecting these problematic conditions is not simple and makes up more than 90% of the execution time of floor. The actual rounding is cheap and could be inlined/vectorized. Also It's a lot of code, so inlining the whole floor-function would make your program run slower.
Some compilers have special compiler flags that allow the compiler to optimize away some of the rarely used c-standard rules. For example GCC can be told that you're not interested in errno at all. To do so pass -fno-math-errno or -ffast-math. ICC and VC may have similar compiler flags.
Btw - You can roll your own floor-function using simple casts. You just have to handle the negative and positive cases differently. That may be a lot faster if you don't need the special handling of overflows and NaNs.
If you are going to convert the result of the floor() operation to an int, and if you aren't worried about overflow, then the following code is much faster than (int)floor(x):
inline int int_floor(double x)
{
int i = (int)x; /* truncate */
return i - ( i > x ); /* convert trunc to floor */
}
Branch-less Floor and Ceiling (better utilize the pipiline) no error check
int f(double x)
{
return (int) x - (x < (int) x); // as dgobbi above, needs less than for floor
}
int c(double x)
{
return (int) x + (x > (int) x);
}
or using floor
int c(double x)
{
return -(f(-x));
}
The actual fastest implementation for a large array on modern x86 CPUs would be
change the MXCSR FP rounding mode to round towards -Infinity (aka floor). In C, this should be possible with fenv stuff, or _mm_getcsr / _mm_setcsr.
loop over the array doing _mm_cvtps_epi32 on SIMD vectors, converting 4 floats to 32-bit integer using the current rounding mode. (And storing the result vectors to the destination.)
cvtps2dq xmm0, [rdi] is a single micro-fused uop on any Intel or AMD CPU since K10 or Core 2. (https://agner.org/optimize/) Same for the 256-bit AVX version, with YMM vectors.
restore the current rounding mode to the normal IEEE default mode, using the original value of the MXCSR. (round-to-nearest, with even as a tiebreak)
This allows loading + converting + storing 1 SIMD vector of results per clock cycle, just as fast as with truncation. (SSE2 has a special FP->int conversion instruction for truncation, exactly because it's very commonly needed by C compilers. In the bad old days with x87, even (int)x required changing the x87 rounding mode to truncation and then back. cvttps2dq for packed float->int with truncation (note the extra t in the mnemonic). Or for scalar, going from XMM to integer registers, cvttss2si or cvttsd2si for scalar double to scalar integer.
With some loop unrolling and/or good optimization, this should be possible without bottlenecking on the front-end, just 1-per-clock store throughput assuming no cache-miss bottlenecks. (And on Intel before Skylake, also bottlenecked on 1-per-clock packed-conversion throughput.) i.e. 16, 32, or 64 bytes per cycle, using SSE2, AVX, or AVX512.
Without changing the current rounding mode, you need SSE4.1 roundps to round a float to the nearest integer float using your choice of rounding modes. Or you could use one of the tricks shows in other answers that work for floats with small enough magnitude to fit in a signed 32-bit integer, since that's your ultimate destination format anyway.)
(With the right compiler options, like -fno-math-errno, and the right -march or -msse4 options, compilers can inline floor using roundps, or the scalar and/or double-precision equivalent, e.g. roundsd xmm1, xmm0, 1, but this costs 2 uops and has 1 per 2 clock throughput on Haswell for scalar or vectors. Actually, gcc8.2 will inline roundsd for floor even without any fast-math options, as you can see on the Godbolt compiler explorer. But that's with -march=haswell. It's unfortunately not baseline for x86-64, so you need to enable it if your machine supports it.)
Yes, floor() is extremely slow on all platforms since it has to implement a lot of behaviour from the IEEE fp spec. You can't really use it in inner loops.
I sometimes use a macro to approximate floor():
#define PSEUDO_FLOOR( V ) ((V) >= 0 ? (int)(V) : (int)((V) - 1))
It does not behave exactly as floor(): for example, floor(-1) == -1 but PSEUDO_FLOOR(-1) == -2, but it's close enough for most uses.
An actually branchless version that requires a single conversion between floating point and integer domains would shift the value x to all positive or all negative range, then cast/truncate and shift it back.
long fast_floor(double x)
{
const unsigned long offset = ~(ULONG_MAX >> 1);
return (long)((unsigned long)(x + offset) - offset);
}
long fast_ceil(double x) {
const unsigned long offset = ~(ULONG_MAX >> 1);
return (long)((unsigned long)(x - offset) + offset );
}
As pointed in the comments, this implementation relies on the temporary value x +- offset not overflowing.
On 64-bit platforms, the original code using int64_t intermediate value will result in three instruction kernel, the same available for int32_t reduced range floor/ceil, where |x| < 0x40000000 --
inline int floor_x64(double x) {
return (int)((int64_t)(x + 0x80000000UL) - 0x80000000LL);
}
inline int floor_x86_reduced_range(double x) {
return (int)(x + 0x40000000) - 0x40000000;
}
They do not do the same thing. floor() is a function. Therefore, using it incurs a function call, allocating a stack frame, copying of parameters and retrieving the result.
Casting is not a function call, so it uses faster mechanisms (I believe that it may use registers to process the values).
Probably floor() is already optimized.
Can you squeeze more performance out of your algorithm? Maybe switching rows and columns may help? Can you cache common values? Are all your compiler's optimizations on? Can you switch an operating system? a compiler?
Jon Bentley's Programming Pearls has a great review of possible optimizations.
Fast double round
double round(double x)
{
return double((x>=0.5)?(int(x)+1):int(x));
}
Terminal log
test custom_1 8.3837
test native_1 18.4989
test custom_2 8.36333
test native_2 18.5001
test custom_3 8.37316
test native_3 18.5012
Test
void test(char* name, double (*f)(double))
{
int it = std::numeric_limits<int>::max();
clock_t begin = clock();
for(int i=0; i<it; i++)
{
f(double(i)/1000.0);
}
clock_t end = clock();
cout << "test " << name << " " << double(end - begin) / CLOCKS_PER_SEC << endl;
}
int main(int argc, char **argv)
{
test("custom_1",round);
test("native_1",std::round);
test("custom_2",round);
test("native_2",std::round);
test("custom_3",round);
test("native_3",std::round);
return 0;
}
Result
Type casting and using your brain is ~3 times faster than using native functions.

Resources