Just found the following line in some old src code:
int e = (int)fmod(matrix[i], n);
where matrix is an array of int, and n is a size_t
I'm wondering why the use of fmod rather than % where we have integer arguments, i.e. why not:
int e = (matrix[i]) % n;
Could there possibly be a performance reason for choosing fmod over % or is it just a strange bit of code?
Could there possibly be a performance reason for choosing fmod over %
or is it just a strange bit of code?
The fmod might be a bit faster on architectures with high-latency IDIV instruction, that takes (say) ~50 cycles or more, so fmod's function call and int <---> doubleconversions cost can be amortized.
According to Agner's Fog instruction tables, IDIV on AMD K10 architecture takes 24-55 cycles. Comparing with modern Intel Haswell, its latency range is listed as 22-29 cycles, however if there are no dependency chains, the reciprocal throughput is much better on Intel, 8-11 clock cycles.
fmod might be a tiny bit faster than the integer division on selected architectures.
Note however that if n has a known non zero value at compile time, matrix[i] % n would be compiled as a multiplication with a small adjustment, which should be much faster than both the integer modulus and the floating point modulus.
Another interesting difference is the behavior on n == 0 and INT_MIN % -1. The integer modulus operation invokes undefined behavior on overflow which results in abnormal program termination on many current architectures. Conversely, the floating point modulus does not have these corner cases, the result is +Infinity, -Infinity, Nan depending on the value of matrix[i] and -INT_MIN, all exceeding the range of int and the conversion back to int is implementation defined, but does not usually cause abnormal program termination. This might be the reason for the original programmer to have chosen this surprising solution.
Experimentally (and quite counter-intuitively), fmod is faster than % - at least on AMD Phenom(tm) II X4 955 with 6400 bogomips. Here are two programs that use either of the techniques, both compiled with the same compiler (GCC) and the same options (cc -O3 foo.c -lm), and ran on the same hardware:
#include <math.h>
#include <stdio.h>
int main()
{
int volatile a=10,b=12;
int i, sum = 0;
for (i = 0; i < 1000000000; i++)
sum += a % b;
printf("%d\n", sum);
return 0;
}
Running time: 9.07 sec.
#include <math.h>
#include <stdio.h>
int main()
{
int volatile a=10,b=12;
int i, sum = 0;
for (i = 0; i < 1000000000; i++)
sum += (int)fmod(a, b);
printf("%d\n", sum);
return 0;
}
Running time: 8.04 sec.
Related
I am working on a programme to calculate 2^p using the traditional school arithmetic technique, multiply by 2, carry 1 if > 9.
I have written two lines of code that return n / 10 and n % 10 using bitwise operators. These are accurate for 0 - 19 inclusive, which is all that is needed when the multiplier is 2: the largest number is (2 * 9) + 1. However this method is inaccurate from 20 onwards and not needed. These techniques speed up the programme.
Because I would like to be certain I am using C correctly, is this technique good practice?
#include <stdio.h>
int main(void) {
unsigned int d, m;
/* division by 10 */
for (int i=0; i<20; ++i) {
d = (i + 6) >> 4;
printf("%u ", d);
}
printf("\n");
/* modulus 10 */
for (int i=0; i<20; ++i) {
m = i - (((i + 6) >> 4) * 10);
printf("%u ", m);
}
printf("\n");
return 0;
}
This is not good practice.
If you compile code that divides or modulos by constants, and enable optimizations, the compiler will do this for you when it's possible and equivalent (on checking, gcc uses a somewhat more complicated set of imul, shifts and a subtraction rather than an integer division instruction; more expensive than what you wrote, but still much cheaper than integer division). And your code won't be an unreadable mess, that breaks when passed values even slightly outside its design parameters.
There are extremely rare cases where you might do something like this, if:
Profiling has shown that the code in question is the bottleneck preventing your code from achieving adequate performance, and tweaking optimization levels is inadequate to fix it, and
You comment the hell out of what you're doing, including the restrictions on inputs and the rationale for doing this
But outside of that case, you never want to write unreadable, unmaintainable, brittle code just to shave a few cycles.
When I run the following code:
#include <stdio.h>
int main()
{
int i = 0;
volatile long double sum = 0;
for (i = 1; i < 50; ++i) /* first snippet */
{
sum += (long double)1 / i;
}
printf("%.20Lf\n", sum);
sum = 0;
for (i = 49; i > 0; --i) /* second snippet */
{
sum += (long double)1 / i;
}
printf("%.20Lf", sum);
return 0;
}
The output is:
4.47920533832942346919
4.47920533832942524555
Shouldn't the two numbers be same?
And more interestingly, the following code:
#include <stdio.h>
int main()
{
int i = 0;
volatile long double sum = 0;
for (i = 1; i < 100; ++i) /* first snippet */
{
sum += (long double)1 / i;
}
printf("%.20Lf\n", sum);
sum = 0;
for (i = 99; i > 0; --i) /* second snippet */
{
sum += (long double)1 / i;
}
printf("%.20Lf", sum);
return 0;
}
produces:
5.17737751763962084084
5.17737751763962084084
So why are they different then and same now?
First, please correct your code. By C standard, %lf isn't principal for *printf ('l' is void, the data type remains double). To print long double, one should use %Lf. With your variant %lf, it's possible to get into a bug with improper format, cut-down value, etc. (You seem running 32-bit environment: in 64 bits, both Unix and Windows pass double in XMM registers, but long double otherwhere - stack for Unix, memory by pointer for Windows. On Windows/x86_64, you code will segfault because callee expects pointer. But, with Visual Studio, long double is AFAIK aliased to double, so you can remain ignorant of this change.)
Second, you can't be sure this code is not optimized by your C compiler to compile-time calculations (which can be done with more precision than default run-time one). To avoid such optimization, mark sum as volatile.
With these changes, your code shows:
At Linux/amd64, gcc4.8:
for 50:
4.47920533832942505776
4.47920533832942505820
for 100:
5.17737751763962026144
5.17737751763962025971
At FreeBSD/i386, gcc4.8, without precision setting or with explicit fpsetprec(FP_PD):
4.47920533832942346919
4.47920533832942524555
5.17737751763962084084
5.17737751763962084084
(the same as in your example);
but, the same test on FreeBSD with fpsetprec(FP_PE), which switches FPU to real long double operations:
4.47920533832942505776
4.47920533832942505820
5.17737751763962026144
5.17737751763962025971
identical to Linux case; so, in real long double, there is some real difference with 100 summands, and it is, in accordance with common sense, larger than for 50. But your platform defaults to rounding to double.
And, finally, in general, this is well-known effect of a finite precision and consequent rounding. For example, in this classical book, this misrounding of decreasing number series sum is explained in the very first chapters.
I am not really ready now to investigate source of results with 50 summands and rounding to double, why it shows such huge difference and why this difference is compensated with 100 summands. That needs much deeper investigation than I can afford now, but, I hope, this answer clearly shows you a next place to dig.
UPDATE: if it's Windows, you can manipulate FPU mode with _controlfp() and _controlfp_s(). In Linux, _FPU_SETCW does the same. This description elaborates some details and gives example code.
UPDATE2: using Kahan summation gives stable results in all cases. The following shows 4 values: ascending i, no KS; ascending i, KS; descending i, no KS; descending i, KS:
50 and FPU to double:
4.47920533832942346919 4.47920533832942524555
4.47920533832942524555 4.47920533832942524555
100 and FPU to double:
5.17737751763962084084 5.17737751763961995266
5.17737751763962084084 5.17737751763961995266
50 and FPU to long double:
4.47920533832942505776 4.47920533832942524555
4.47920533832942505820 4.47920533832942524555
100 and FPU to long double:
5.17737751763962026144 5.17737751763961995266
5.17737751763962025971 5.17737751763961995266
you can see difference disappeared, results are stable. I would assume this is nearly final point that can be added here :)
When i run the following code
/*Program to find the greatest common divisor of two nonnegative integer
values*/
#include <stdio.h>
int main(void){
printf(" n | n^2\n");
printf("-----------------\n");
for(int n = 1; n<11; n++){
int nSquared = n^2;
printf("%i %i\n",n,nSquared);
}
}
The table that gets returned to the terminal displays as follows
n | n^2
-----------------
1 3
2 0
3 1
4 6
5 7
6 4
7 5
8 10
9 11
10 8
why does the "n^2" side generate the wrong numbers? And is there a way to write superscripts and subscripts in C, so I do not have to display "n^2" and can display that side of the column as "n²" instead?
Use pow function from math.h.
^ is the bitwise exclusive OR operator and has to nothing to do with a power function.
The ^ is the XOR operation. You'd either want to use the math.h function "pow", or write your own.
^ is the bitwise xor operator. You should use the pow function declared in the math.h header.
#include <stdio.h>
#include <math.h>
int main(void) {
printf(" n | n^2\n");
printf("-----------------\n");
for(int n = 1; n < 11; n++){
int nSquared = pow(n, 2); // calculate n raised to 2
printf("%i %i\n", n, nSquared);
}
return 0;
}
Include the math library by the flag -lm for gcc compilation.
As others have pointed out, the problem is that ^ is the bitwise xor operator. C has no exponentiation operator.
You're being advised to use the pow() function to compute the square of an int value.
That's likely to work (if you're careful), but it's not the best approach. The pow function takes two double arguments and returns a double result. (There are powf and powl functions that operator on float and long double, respectively.) That means that pow has to be able to handle arbitrary floating-point exponents; for example, pow(2.0, 1.0/3.0) will give you an approximation of the cube root of two.
Like many floating-point operations, pow is subject to the possibility of rounding errors. It's possible that pow(3.0, 2.0) will yield a result that's just slightly less than 9.0; converting that to int will give you 8 rather than 9. And even if you manage to avoid that problem, converting from integer to floating-point, performing an expensive operation, and then converting back to integer is massive overkill. (The implementation might optimize calls to pow with integer exponents, but I wouldn't count on that.)
It's been said (with slight exaggeration) that premature optimization is the root of all evil, and the time spent doing the extra computations is not likely to be noticeable. But in this case there's a way to do what you want that's both simpler and more efficient. Rather than
int nSquared = n^2;
which is incorrect, or
int nSquared = pow(n, 2);
which is inefficient and possibly unreliable, just write:
int nSquared = n * n;
If have the following C function, used to determine if one number is a multiple of another to an arbirary tolerance
#include <math.h>
#define TOLERANCE 0.0001
int IsMultipleOf(double x,double mod)
{
return(fabs(fmod(x, mod)) < TOLERANCE);
}
It works fine, but profiling shows it to be very slow, to the extent that it has become a candidate for optimization. About 75% of the time is spent in modulo and the remaining in fabs. I'm trying to figure a way of speeding things up, using something like a look-up table. The parameter x changes regularly, whereas mod changes infrequently. The number of possible values of x is small enough that the space for a look-up would not be an issue, typically it will be one of a few hundred possible values. I can get rid of the fabs easily enough, but can't figure out a reasonable alternative to the modulo. Any ideas on how to optimize the above?
Edit The code will be running on a wide range of Windows desktop and mobile devices, hence processors could include Intel, AMD on desktop, and ARM or SH4 on mobile devices. VisualStudio 2008 is the compiler.
Do you really have to use modulo for this?
Wouldn't it be possible to just result = x / mod and then check if the decimal part of result is close to 0. For instance:
11 / 5.4999 = 2.000003 ==> 0.000003 < TOLERANCE
Or something like that.
Division (floating point or not, fmod in your case) is often an operation where the execution time varies a lot depending on the cpu and compiler:
gcc has a builtin replacement for
that if you give it the right compile
flags or if you use __builtin_fmod
explicitly. This then might map the
operation on a small number of
assembler instructions.
there may be special units like SSE
on intel processors where this
operation is implemented more
efficiently
By such tricks, depending on your environment (you didn't tell which) the time may vary from some clock cycles to some hundred. I think best is to look into the documentation of your compiler and cpu for that particular operation.
The following is probably overkill, and sub-optimal. But for what it is worth here is one way on how to do it.
We know the format of the double ...
1 bit for the sign
11 bits for the biased exponent
52 fraction bits
Let ...
value = x / mod;
exp = exponent bits of value - BIAS;
lsb = least sig bit of value's fraction bits;
Once you have that ...
/*
* If applying the exponent would eliminate the fraction bits
* then for double precision resolution it is a multiple.
* Note: lsb may require some massaging.
*/
if (exp > lsb)
return (true);
if (exp < 0)
return (false);
The only case remaining is the tolerance case. Build your double so that you are getting rid of all the digits to the left of the decimal.
sign bit is zero (positive)
exponent is the BIAS (1023 I think ... look it up to be sure)
shift the fraction bits as appropriate
Now compare it against your tolerance.
I think you need to inspect the bowels of your C RTL fmod() function: X86 FPU's have 'FPREM/FPREM1' instructions which computes remainders by repeated subtraction.
While floating point division is a single instruction, it seems you may need to call FPREM repeatedly to get the right answer for modulus, so your RTL may not use it.
I have not tested this at all, but from the way I understand fmod this should be equivalent inlined, which might let the compiler optimize it better, though I would have thought that the compiler's math library (or builtins) would work just as well. (also, I don't even know for sure if this is correct).
#include <math.h>
int IsMultipleOf(double x, double mod) {
long n = x / mod; // You should probably test for /0 or NAN result here
double new_x = mod * n;
double delta = x - new_x;
return fabs(delta) < TOLERANCE; // and for NAN result from fabs
}
Maybe you can get away with long long instead of double if you have comparable scale of data. For example long long would be enough for over 60 astronomical units in micrometer resolution.
Does it need to be double precision ? Depending on how good your math library is, this ought to be faster:
#include <math.h>
#define TOLERANCE 0.0001f
bool IsMultipleOf(float x, float mod)
{
return(fabsf(fmodf(x, mod)) < TOLERANCE);
}
I presume modulo looks a little like this on the inside:
mod(x,m) {
while (x > m) {
x = x - m
}
return x
}
I think that through some sort of search i could be optimised: eg:
fastmod(x,m) {
q = 1
while (m * q < x) {
q = q * 2
}
return mod((x - (q / 2) * m), m)
}
You might even choose to replace the finall call to mod with annother call to fastmod, adding the condition that if x < m then to return x.
I wrote some code recently (ISO/ANSI C), and was surprised at the poor performance it achieved. Long story short, it turned out that the culprit was the floor() function. Not only it was slow, but it did not vectorize (with Intel compiler, aka ICL).
Here are some benchmarks for performing floor for all cells in a 2D matrix:
VC: 0.10
ICL: 0.20
Compare that to a simple cast:
VC: 0.04
ICL: 0.04
How can floor() be that much slower than a simple cast?! It does essentially the same thing (apart for negative numbers).
2nd question: Does someone know of a super-fast floor() implementation?
PS: Here is the loop that I was benchmarking:
void Floor(float *matA, int *intA, const int height, const int width, const int width_aligned)
{
float *rowA=NULL;
int *intRowA=NULL;
int row, col;
for(row=0 ; row<height ; ++row){
rowA = matA + row*width_aligned;
intRowA = intA + row*width_aligned;
#pragma ivdep
for(col=0 ; col<width; ++col){
/*intRowA[col] = floor(rowA[col]);*/
intRowA[col] = (int)(rowA[col]);
}
}
}
A couple of things make floor slower than a cast and prevent vectorization.
The most important one:
floor can modify the global state. If you pass a value that is too huge to be represented as an integer in float format, the errno variable gets set to EDOM. Special handling for NaNs is done as well. All this behavior is for applications that want to detect the overflow case and handle the situation somehow (don't ask me how).
Detecting these problematic conditions is not simple and makes up more than 90% of the execution time of floor. The actual rounding is cheap and could be inlined/vectorized. Also It's a lot of code, so inlining the whole floor-function would make your program run slower.
Some compilers have special compiler flags that allow the compiler to optimize away some of the rarely used c-standard rules. For example GCC can be told that you're not interested in errno at all. To do so pass -fno-math-errno or -ffast-math. ICC and VC may have similar compiler flags.
Btw - You can roll your own floor-function using simple casts. You just have to handle the negative and positive cases differently. That may be a lot faster if you don't need the special handling of overflows and NaNs.
If you are going to convert the result of the floor() operation to an int, and if you aren't worried about overflow, then the following code is much faster than (int)floor(x):
inline int int_floor(double x)
{
int i = (int)x; /* truncate */
return i - ( i > x ); /* convert trunc to floor */
}
Branch-less Floor and Ceiling (better utilize the pipiline) no error check
int f(double x)
{
return (int) x - (x < (int) x); // as dgobbi above, needs less than for floor
}
int c(double x)
{
return (int) x + (x > (int) x);
}
or using floor
int c(double x)
{
return -(f(-x));
}
The actual fastest implementation for a large array on modern x86 CPUs would be
change the MXCSR FP rounding mode to round towards -Infinity (aka floor). In C, this should be possible with fenv stuff, or _mm_getcsr / _mm_setcsr.
loop over the array doing _mm_cvtps_epi32 on SIMD vectors, converting 4 floats to 32-bit integer using the current rounding mode. (And storing the result vectors to the destination.)
cvtps2dq xmm0, [rdi] is a single micro-fused uop on any Intel or AMD CPU since K10 or Core 2. (https://agner.org/optimize/) Same for the 256-bit AVX version, with YMM vectors.
restore the current rounding mode to the normal IEEE default mode, using the original value of the MXCSR. (round-to-nearest, with even as a tiebreak)
This allows loading + converting + storing 1 SIMD vector of results per clock cycle, just as fast as with truncation. (SSE2 has a special FP->int conversion instruction for truncation, exactly because it's very commonly needed by C compilers. In the bad old days with x87, even (int)x required changing the x87 rounding mode to truncation and then back. cvttps2dq for packed float->int with truncation (note the extra t in the mnemonic). Or for scalar, going from XMM to integer registers, cvttss2si or cvttsd2si for scalar double to scalar integer.
With some loop unrolling and/or good optimization, this should be possible without bottlenecking on the front-end, just 1-per-clock store throughput assuming no cache-miss bottlenecks. (And on Intel before Skylake, also bottlenecked on 1-per-clock packed-conversion throughput.) i.e. 16, 32, or 64 bytes per cycle, using SSE2, AVX, or AVX512.
Without changing the current rounding mode, you need SSE4.1 roundps to round a float to the nearest integer float using your choice of rounding modes. Or you could use one of the tricks shows in other answers that work for floats with small enough magnitude to fit in a signed 32-bit integer, since that's your ultimate destination format anyway.)
(With the right compiler options, like -fno-math-errno, and the right -march or -msse4 options, compilers can inline floor using roundps, or the scalar and/or double-precision equivalent, e.g. roundsd xmm1, xmm0, 1, but this costs 2 uops and has 1 per 2 clock throughput on Haswell for scalar or vectors. Actually, gcc8.2 will inline roundsd for floor even without any fast-math options, as you can see on the Godbolt compiler explorer. But that's with -march=haswell. It's unfortunately not baseline for x86-64, so you need to enable it if your machine supports it.)
Yes, floor() is extremely slow on all platforms since it has to implement a lot of behaviour from the IEEE fp spec. You can't really use it in inner loops.
I sometimes use a macro to approximate floor():
#define PSEUDO_FLOOR( V ) ((V) >= 0 ? (int)(V) : (int)((V) - 1))
It does not behave exactly as floor(): for example, floor(-1) == -1 but PSEUDO_FLOOR(-1) == -2, but it's close enough for most uses.
An actually branchless version that requires a single conversion between floating point and integer domains would shift the value x to all positive or all negative range, then cast/truncate and shift it back.
long fast_floor(double x)
{
const unsigned long offset = ~(ULONG_MAX >> 1);
return (long)((unsigned long)(x + offset) - offset);
}
long fast_ceil(double x) {
const unsigned long offset = ~(ULONG_MAX >> 1);
return (long)((unsigned long)(x - offset) + offset );
}
As pointed in the comments, this implementation relies on the temporary value x +- offset not overflowing.
On 64-bit platforms, the original code using int64_t intermediate value will result in three instruction kernel, the same available for int32_t reduced range floor/ceil, where |x| < 0x40000000 --
inline int floor_x64(double x) {
return (int)((int64_t)(x + 0x80000000UL) - 0x80000000LL);
}
inline int floor_x86_reduced_range(double x) {
return (int)(x + 0x40000000) - 0x40000000;
}
They do not do the same thing. floor() is a function. Therefore, using it incurs a function call, allocating a stack frame, copying of parameters and retrieving the result.
Casting is not a function call, so it uses faster mechanisms (I believe that it may use registers to process the values).
Probably floor() is already optimized.
Can you squeeze more performance out of your algorithm? Maybe switching rows and columns may help? Can you cache common values? Are all your compiler's optimizations on? Can you switch an operating system? a compiler?
Jon Bentley's Programming Pearls has a great review of possible optimizations.
Fast double round
double round(double x)
{
return double((x>=0.5)?(int(x)+1):int(x));
}
Terminal log
test custom_1 8.3837
test native_1 18.4989
test custom_2 8.36333
test native_2 18.5001
test custom_3 8.37316
test native_3 18.5012
Test
void test(char* name, double (*f)(double))
{
int it = std::numeric_limits<int>::max();
clock_t begin = clock();
for(int i=0; i<it; i++)
{
f(double(i)/1000.0);
}
clock_t end = clock();
cout << "test " << name << " " << double(end - begin) / CLOCKS_PER_SEC << endl;
}
int main(int argc, char **argv)
{
test("custom_1",round);
test("native_1",std::round);
test("custom_2",round);
test("native_2",std::round);
test("custom_3",round);
test("native_3",std::round);
return 0;
}
Result
Type casting and using your brain is ~3 times faster than using native functions.