Why the cos function in math.h faster than x86 fcos instruction

Why the cos function in math.h faster than x86 fcos instruction - c

The cos() in math.h run faster than the x86 asm fcos.
The following code is compare between the x86 fcos and the cos() in math.h.
In this code, 1000000 times asm fcos cost 150ms; 1000000 times cos() call cost only 80ms.
How is the fcos implemented in x86?
Why is the fcos much slower than cos()?
My enviroment is intel i7-6820HQ + win10 + visual studio 2017.
#include "string"
#include "iostream"
#include<time.h>
#include "math.h"
int main()
{
int i;
const int i_max = 1000000;
float c = 10000;
float *d = &c;
float start_value = 8.333333f;
float* pstart_value = &start_value;
clock_t a, b;
a = clock();
__asm {
mov edx, pstart_value;
fld [edx];
}
for (i = 0; i < i_max; i++) {
__asm {
fcos;
}
}
b = clock();
printf("asm time = %u", b - a);
a = clock();
double y;
for (i = 0; i < i_max; i++) {
start_value = cos(start_value);
}
b = clock();
printf("math time = %u", b - a);
return 0;
}
According to my personal understanding, a single asm instruction is usually faster than a function call.
Why in this case the fcos so slow?
Update:
I have run the same code on another laptop with i7-6700HQ.
On this laptop the 1000000 times fcos cost only 51ms. Why there is such a big difference between the two cpus.

I bet the answer is easy. You do not use the result of cos and it is optimized out as in this example
https://godbolt.org/z/iw-nft
Change the variables to volatile to force cos call.
https://godbolt.org/z/9_dpMs
Another guess:
Maybe your cos implementation uses lookup tables. Then it will be faster than the hardware implementation.

Related

About the combination of OpenMP and -Ofast

I implemented OpenMP parallelization in a for loop where I have a sum that is the principal cause of slowing down my code. When I did so, I found out that the final results were not the same that I obtained for the non-parallelize code (which is written in C). So first, one might think "well, I just didn't implemented well the parallelization" but the curious thing is that when I run the parallelized code with the -Ofast optimization suddenly the results are correct.
That would be:
-O0 correct
-Ofast correct
OMP -O0 wrong
OMP -O1 wrong
OMP -O2 wrong
OMP -O3 wrong
OMP -Ofast correct!
What could -Ofast be doing that solves an error that only appears when I implement openmp?
Any recommendation of what could I check or test?
Thanks!
EDIT
Here I include the smallest version of my code that still reproduces the problem.
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
#define LENGTH 100
#define R 50.0
#define URD 1.0/sqrt(2.0)
#define PI (4.0*atan(1.0)) //pi
const gsl_rng_type * Type;
gsl_rng * item;
double CalcDeltaEnergy(double **M,int sx,int sy){
double DEnergy,r,zz;
int k,j;
double rrx,rry;
int rx,ry;
double Energy, Cpm, Cmm, Cmp, Cpp;
DEnergy = 0;
//OpenMP parallelization:
#pragma omp parallel for reduction (+:DEnergy)
for (int index = 0; index < LENGTH*LENGTH; index++){
k = index % LENGTH;
j = index / LENGTH;
zz = 0.5*(1.0 - pow(-1.0, k + j + sx + sy));
for (rx = -1; rx <= 1; rx++){
for (ry = -1; ry <= 1; ry++){
rrx = (sx - k - rx*LENGTH)*URD;
rry = (sy - j - ry*LENGTH)*URD;
r = sqrt(rrx*rrx + rry*rry + zz);
if(r != 0 && r <= R){
Cpm = sqrt((rrx+0.5*(0.702*cos(M[k][j])-0.702*cos(M[sx][sy])))*(rrx+0.5*(0.702*cos(M[k][j])-0.702*cos(M[sx][sy]))) + (rry+0.5*(0.702*sin(M[k][j])-0.702*sin(M[sx][sy])))*(rry+0.5*(0.702*sin(M[k][j])-0.702*sin(M[sx][sy]))) + zz);
Cmm = sqrt((rrx-0.5*(0.702*cos(M[k][j])-0.702*cos(M[sx][sy])))*(rrx-0.5*(0.702*cos(M[k][j])-0.702*cos(M[sx][sy]))) + (rry-0.5*(0.702*sin(M[k][j])-0.702*sin(M[sx][sy])))*(rry-0.5*(0.702*sin(M[k][j])-0.702*sin(M[sx][sy]))) + zz);
Cpp = sqrt((rrx+0.5*(0.702*cos(M[k][j])+0.702*cos(M[sx][sy])))*(rrx+0.5*(0.702*cos(M[k][j])+0.702*cos(M[sx][sy]))) + (rry+0.5*(0.702*sin(M[k][j])+0.702*sin(M[sx][sy])))*(rry+0.5*(0.702*sin(M[k][j])+0.702*sin(M[sx][sy]))) + zz);
Cmp = sqrt((rrx-0.5*(0.702*cos(M[k][j])+0.702*cos(M[sx][sy])))*(rrx-0.5*(0.702*cos(M[k][j])+0.702*cos(M[sx][sy]))) + (rry-0.5*(0.702*sin(M[k][j])+0.702*sin(M[sx][sy])))*(rry-0.5*(0.702*sin(M[k][j])+0.702*sin(M[sx][sy]))) + zz);
Cpm = 1.0/Cpm;
Cmm = 1.0/Cmm;
Cpp = 1.0/Cpp;
Cmp = 1.0/Cmp;
Energy = (Cpm + Cmm - Cpp - Cmp)/(0.702*0.702); // S=cte=1
DEnergy -= 2.0*Energy;
}
}
}
}
return DEnergy;
}
void Initialize(double **M){
double random;
for(int i=0;i<(LENGTH-1);i=i+2){
for(int j=0;j<(LENGTH-1);j=j+2) {
random=gsl_rng_uniform(item);
if (random<0.5) M[i][j]=PI/4.0;
else M[i][j]=5.0*PI/4.0;
random=gsl_rng_uniform(item);
if (random<0.5) M[i][j+1]=3.0*PI/4.0;
else M[i][j+1]=7.0*PI/4.0;
random=gsl_rng_uniform(item);
if (random<0.5) M[i+1][j]=3.0*PI/4.0;
else M[i+1][j]=7.0*PI/4.0;
random=gsl_rng_uniform(item);
if (random<0.5) M[i+1][j+1]=PI/4.0;
else M[i+1][j+1]=5.0*PI/4.0;
}
}
}
int main(){
//Choose and initiaze the random number generator
gsl_rng_env_setup();
Type = gsl_rng_default; //default=mt19937, ran2, lxs0
item = gsl_rng_alloc (Type);
double **S; //site matrix
S = (double **) malloc(LENGTH*sizeof(double *));
for (int i = 0; i < LENGTH; i++)
S[i] = (double *) malloc(LENGTH*sizeof(double ));
//Initialization
Initialize(S);
int l,m;
for (int cl = 0; cl < LENGTH*LENGTH; cl++) {
l = gsl_rng_uniform_int(item, LENGTH); // RNG[0, LENGTH-1]
m = gsl_rng_uniform_int(item, LENGTH); // RNG[0, LENGTH-1]
printf("%lf\n", CalcDeltaEnergy(S, l, m));
}
//Free memory
for (int i = 0; i < LENGTH; i++)
free(S[i]);
free(S);
return 0;
}
I compile with:
g++ [optimization] -lm test.c -o test.x -lgsl -lgslcblas -fopenmp
and run with:
GSL_RNG_SEED=123; ./test.x > test.dat
Comparing the outputs for different optimizations one can see what I stated before.

Disclaimer: I have little to no experience with OpenMP
It's probably a race condition you run into when using OpenMP.
You'll need to declare all those variables inside the OpenMP loop to be private. One core may calculate their values for a certain value of index, which gets promptly recalculated to different values on a core that uses another value of index: the variables such as k, j, rrx, rry etc are shared between the compute nodes.
Instead of using a pragma like
#pragma omp parallel for private(k,j,zz,rx,ry,rrx,rry,r,Cpm,Cmm,Cpp,Cmp,Energy) reduction (+:D\
(credits to comment by Zulan below:) you can also declare the variables inside the parallel region, as locally as possible. This makes them private implicitly and is less prone to initialization issues and easier to reason about.
(You could even consider putting everything inside the outer for-loop (over index) in a function: the function call overhead is minimal compared to the calculations.)
As to why -Ofast together with OpenMP does actually produce correct output.
My guess is: mostly luck. Here's what -Ofast does (gcc manual):
Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard-compliant programs. It turns on -ffast-math [...]
Here's the section on -ffast-math:
This option is not turned on by any -O option besides -Ofast since it can result in incorrect output for programs that depend on an exact implementation of IEEE or ISO rules/specifications for math functions. It may, however, yield faster code for programs that do not require the guarantees of these specifications.
Thus, the sqrt, cos and sin will likely be a lot speedier. My guess is, that in this case, the calculations of the variables inside the outer loop don't bite each other, since the individual threads are so fast, they don't conflict. But that is a very handwavingly explanation and guess.

Parallelization for Monte Carlo pi approximation

I am writing a c script to parallelize pi approximation with OpenMp. I think my code works fine with a convincing output. I am running it with 4 threads now. What I am not sure is that if this code is vulnerable to race condition? and if it is, how do I coordinate the thread action in this code ?
the code looks as follows:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <omp.h>
double sample_interval(double a, double b) {
double x = ((double) rand())/((double) RAND_MAX);
return (b-a)*x + a;
}
int main (int argc, char **argv) {
int N = atoi( argv[1] ); // convert command-line input to N = number of points
int i;
int NumThreads = 4;
const double pi = 3.141592653589793;
double x, y, z;
double counter = 0;
#pragma omp parallel firstprivate(x, y, z, i) reduction(+:counter) num_threads(NumThreads)
{
srand(time(NULL));
for (int i=0; i < N; ++i)
{
x = sample_interval(-1.,1.);
y = sample_interval(-1.,1.);
z = ((x*x)+(y*y));
if (z<= 1)
{
counter++;
}
}
}
double approx_pi = 4.0 * counter/ (double)N;
printf("%i %1.6e %1.6e\n ", N, 4.0 * counter/ (double)N, fabs(4.0 * counter/ (double)N - pi) / pi);
return 0;
}
Also I was wondering if the seed for random number should be declared inside or outside parallelization. my output looks like this:
10 3.600000e+00 1.459156e-01
100 3.160000e+00 5.859240e-03
1000 3.108000e+00 1.069287e-02
10000 3.142400e+00 2.569863e-04
100000 3.144120e+00 8.044793e-04
1000000 3.142628e+00 3.295610e-04
10000000 3.141379e+00 6.794439e-05
100000000 3.141467e+00 3.994585e-05
1000000000 3.141686e+00 2.971945e-05
Which looks OK for now. your suggestion for race condition and seed placement is most welcome.

There are a few problems in your code that I can see. The main one is from my standpoint that it isn't parallelized. Or more precisely, you didn't enable the parallelism you introduced with OpenMP while compiling it. Here is the way one can see that:
The way the code is parallelized, the main for loop should be executed in full by all the threads (there is no worksharing here, no #pragma omp parallel for, only a #pragma omp parallel). Therefore, considering you set the number of threads to be 4, the global number of iterations should be 4*N. Thus, your output should slowly converge towards 4*Pi, not towards Pi.
Indeed, I tried your code on my laptop, compiled it with OpenMP support, and that is pretty-much what I get. However, when I don't enable OpenMP, I get an output similar to yours. So in conclusion, you need to:
Enable OpenMP at compilation time for getting a parallel version of your code.
Divide your result by NumThreads to get a "valid" approximation of Pi (or distribute your loop over N with a #pragma omp for for example)
But that is if / when your code is correct elsewhere, which it isn't yet.
As BitTickler already hinted, rand() isn't thread-safe. So you have to go for another random number generator, which will allow you to privatize it's state. That could be rand_r() for example. That said, this still has quite a few issues:
rand() / rand_r() is a terrible RNG in term of randomness and periodicity. While increasing your number of tries, you'll rapidly go over the period of the RNG and repeat over and over again the same sequence. You need something more robust to do anything remotely serious.
Even with a "good" RNG, the parallelism aspect can be an issue in the sense that you want your sequences in parallel to be uncorrelated between each-other. And just using a different seed value per thread doesn't guaranty that to you (although with a wide-enough RNG, you have a bit of headroom for that)
Anyway, bottom line is:
Use a better thread-safe RNG (I find drand48_r() or random_r() to be OK for toy codes on Linux)
Initialize its state per-thread based on the thread id for example, while keeping in mind that this won't ensure a proper decorrelation of the random series in some circumstances (and the larger the number of times you call the functions, the more likely you are to finally have overlapping series).
This done (along with a few minor fixes), your code becomes for example as follows:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <omp.h>
typedef struct drand48_data RNGstate;
double sample_interval(double a, double b, RNGstate *state) {
double x;
drand48_r(state, &x);
return (b-a)*x + a;
}
int main (int argc, char **argv) {
int N = atoi( argv[1] ); // convert command-line input to N = number of points
int NumThreads = 4;
const double pi = 3.141592653589793;
double x, y, z;
double counter = 0;
time_t ctime = time(NULL);
#pragma omp parallel private(x, y, z) reduction(+:counter) num_threads(NumThreads)
{
RNGstate state;
srand48_r(ctime+omp_get_thread_num(), &state);
for (int i=0; i < N; ++i) {
x = sample_interval(-1, 1, &state);
y = sample_interval(-1, 1, &state);
z = ((x*x)+(y*y));
if (z<= 1) {
counter++;
}
}
}
double approx_pi = 4.0 * counter / (NumThreads * N);
printf("%i %1.6e %1.6e\n ", N, approx_pi, fabs(approx_pi - pi) / pi);
return 0;
}
Which I compile like this:
gcc -std=gnu99 -fopenmp -O3 -Wall pi.c -o pi_omp

OpenMP vs gcc compiler optimizations

I'm learning openmp using the example of computing the value of pi via quadature. In serial, I run the following C code:
double serial() {
double step;
double x,pi,sum = 0.0;
step = 1.0 / (double) num_steps;
for (int i = 0; i < num_steps; i++) {
x = (i + 0.5) * step; // forward quadature
sum += 4.0 / (1.0 + x*x);
}
pi = step * sum;
return pi;
}
I'm comparing this to an omp implementation using a parallel for with reduction:
double SPMD_for_reduction() {
double step;
double pi,sum = 0.0;
step = 1.0 / (double) num_steps;
#pragma omp parallel for reduction (+:sum)
for (int i = 0; i < num_steps; i++) {
double x = (i + 0.5) * step;
sum += 4.0 / (1.0 + x*x);
}
pi = step * sum;
return pi;
}
For num_steps = 1,000,000,000, and 6 threads in the case of omp, I compile and time:
double start_time = omp_get_wtime();
serial();
double end_time = omp_get_wtime();
start_time = omp_get_wtime();
SPMD_for_reduction();
end_time = omp_get_wtime();
Using no cc compiler optimizations, the runtimes are around 4s (Serial) and .66s (omp). With the -O3 flag, serial runtime drops to ".000001s" and the omp runtime is mostly unchanged. What's going on here? Is it vector instructions being used, or is it poor code or timing method? If it's vectorization, why isn't the omp function benefiting?
It may be of interest that the machine I am using is using a modern 6 core Xeon processor.
Thanks!

The compiler outsmarts you. For the serial version it is able to detect, that the result of your computation is never used. Therefore it throws out the computation completely.
double start_time = omp_get_wtime();
serial(); //<-- Computations not used.
double end_time = omp_get_wtime();
In the openMP case the compiler can not see if really everything inside the function body is without an effect, so to stay on the safe side it keeps the function call.
You can of course write something like double serial_pi = serial(); and outside of the time measurement do some dummy stuff with the variable serial_pi. This way the compiler will keep the function call and do the optimizations you are actually looking for.

Measuring processor ticks in C

I wanted to calculate the difference in execution time when executing the same code inside a function. To my surprise, however, sometimes the clock difference is 0 when I use clock()/clock_t for the start and stop timer. Does this mean that clock()/clock_t does not actually return the number of clicks the processor spent on the task?
After a bit of searching, it seemed to me that clock_gettime() would return more fine grained results. And indeed it does, but I instead end up with an abitrary number of nano(?)seconds. It gives a hint of the difference in execution time, but it's hardly accurate as to exactly how many clicks difference it amounts to. What would I have to do to find this out?
#include <math.h>
#include <stdio.h>
#include <time.h>
#define M_PI_DOUBLE (M_PI * 2)
void rotatetest(const float *x, const float *c, float *result) {
float rotationfraction = *x / *c;
*result = M_PI_DOUBLE * rotationfraction;
}
int main() {
int i;
long test_total = 0;
int test_count = 1000000;
struct timespec test_time_begin;
struct timespec test_time_end;
float r = 50.f;
float c = 2 * M_PI * r;
float x = 3.f;
float result_inline = 0.f;
float result_function = 0.f;
for (i = 0; i < test_count; i++) {
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &test_time_begin);
float rotationfraction = x / c;
result_inline = M_PI_DOUBLE * rotationfraction;
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &test_time_end);
test_total += test_time_end.tv_nsec - test_time_begin.tv_nsec;
}
printf("Inline clocks %li, avg %f (result is %f)\n", test_total, test_total / (float)test_count,result_inline);
for (i = 0; i < test_count; i++) {
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &test_time_begin);
rotatetest(&x, &c, &result_function);
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &test_time_end);
test_total += test_time_end.tv_nsec - test_time_begin.tv_nsec;
}
printf("Function clocks %li, avg %f (result is %f)\n", test_total, test_total / (float)test_count, result_inline);
return 0;
}
I am using gcc version 4.8.4 on Linux 3.13.0-37-generic (Linux Mint 16)

First of all: As already mentioned in the comments, clocking a single run of execution one by the other will probably do you no good. If all goes down the hill, the call for getting the time might actually take longer than the actual execution of the operation.
Please clock multiple runs of the operation (including a warm up phase so everything is swapped in) and calculate the average running times.
clock() isn't guaranteed to be monotonic. It also isn't the number of processor clicks (whatever you define this to be) the program has run. The best way to describe the result from clock() is probably "a best effort estimation of the time any one of the CPUs has spent on calculation for the current process". For benchmarking purposes clock() is thus mostly useless.
As per specification:
The clock() function returns the implementation's best approximation to the processor time used by the process since the beginning of an implementation-dependent time related only to the process invocation.
And additionally
To determine the time in seconds, the value returned by clock() should be divided by the value of the macro CLOCKS_PER_SEC.
So, if you call clock() more often than the resolution, you are out of luck.
For profiling/benchmarking, you should --if possible-- use one of the performance clocks that are available on modern hardware. The prime candidates are probably
The HPET
The TSC
Edit: The question now references CLOCK_PROCESS_CPUTIME_ID, which is Linux' way of exposing the TSC.
If any (or both) are available depends on the hardware in is also operating system specific.

After googling a little bit I can see that clock() function can be used as a standard mechanism to find the tome taken for execution , but be aware that the time will be varying at different time depending upon the load of your processor,
You can just use the below code for calculation
clock_t begin, end;
double time_spent;
begin = clock();
/* here, do your time-consuming job */
end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;

Keep float number in range [duplicate]

Is there a more efficient way to clamp real numbers than using if statements or ternary operators?
I want to do this both for doubles and for a 32-bit fixpoint implementation (16.16). I'm not asking for code that can handle both cases; they will be handled in separate functions.
Obviously, I can do something like:
double clampedA;
double a = calculate();
clampedA = a > MY_MAX ? MY_MAX : a;
clampedA = a < MY_MIN ? MY_MIN : a;
or
double a = calculate();
double clampedA = a;
if(clampedA > MY_MAX)
clampedA = MY_MAX;
else if(clampedA < MY_MIN)
clampedA = MY_MIN;
The fixpoint version would use functions/macros for comparisons.
This is done in a performance-critical part of the code, so I'm looking for an as efficient way to do it as possible (which I suspect would involve bit-manipulation)
EDIT: It has to be standard/portable C, platform-specific functionality is not of any interest here. Also, MY_MIN and MY_MAX are the same type as the value I want clamped (doubles in the examples above).

Both GCC and clang generate beautiful assembly for the following simple, straightforward, portable code:
double clamp(double d, double min, double max) {
const double t = d < min ? min : d;
return t > max ? max : t;
}
> gcc -O3 -march=native -Wall -Wextra -Wc++-compat -S -fverbose-asm clamp_ternary_operator.c
GCC-generated assembly:
maxsd %xmm0, %xmm1 # d, min
movapd %xmm2, %xmm0 # max, max
minsd %xmm1, %xmm0 # min, max
ret
> clang -O3 -march=native -Wall -Wextra -Wc++-compat -S -fverbose-asm clamp_ternary_operator.c
Clang-generated assembly:
maxsd %xmm0, %xmm1
minsd %xmm1, %xmm2
movaps %xmm2, %xmm0
ret
Three instructions (not counting the ret), no branches. Excellent.
This was tested with GCC 4.7 and clang 3.2 on Ubuntu 13.04 with a Core i3 M 350.
On a side note, the straightforward C++ code calling std::min and std::max generated the same assembly.
This is for doubles. And for int, both GCC and clang generate assembly with five instructions (not counting the ret) and no branches. Also excellent.
I don't currently use fixed-point, so I will not give an opinion on fixed-point.

Old question, but I was working on this problem today (with doubles/floats).
The best approach is to use SSE MINSS/MAXSS for floats and SSE2 MINSD/MAXSD for doubles. These are branchless and take one clock cycle each, and are easy to use thanks to compiler intrinsics. They confer more than an order of magnitude increase in performance compared with clamping with std::min/max.
You may find that surprising. I certainly did! Unfortunately VC++ 2010 uses simple comparisons for std::min/max even when /arch:SSE2 and /FP:fast are enabled. I can't speak for other compilers.
Here's the necessary code to do this in VC++:
#include <mmintrin.h>
float minss ( float a, float b )
{
// Branchless SSE min.
_mm_store_ss( &a, _mm_min_ss(_mm_set_ss(a),_mm_set_ss(b)) );
return a;
}
float maxss ( float a, float b )
{
// Branchless SSE max.
_mm_store_ss( &a, _mm_max_ss(_mm_set_ss(a),_mm_set_ss(b)) );
return a;
}
float clamp ( float val, float minval, float maxval )
{
// Branchless SSE clamp.
// return minss( maxss(val,minval), maxval );
_mm_store_ss( &val, _mm_min_ss( _mm_max_ss(_mm_set_ss(val),_mm_set_ss(minval)), _mm_set_ss(maxval) ) );
return val;
}
The double precision code is the same except with xxx_sd instead.
Edit: Initially I wrote the clamp function as commented. But looking at the assembler output I noticed that the VC++ compiler wasn't smart enough to cull the redundant move. One less instruction. :)

If your processor has a fast instruction for absolute value (as the x86 does), you can do a branchless min and max which will be faster than an if statement or ternary operation.
min(a,b) = (a + b - abs(a-b)) / 2
max(a,b) = (a + b + abs(a-b)) / 2
If one of the terms is zero (as is often the case when you're clamping) the code simplifies a bit further:
max(a,0) = (a + abs(a)) / 2
When you're combining both operations you can replace the two /2 into a single /4 or *0.25 to save a step.
The following code is over 3x faster than ternary on my Athlon II X2, when using the optimization for FMIN=0.
double clamp(double value)
{
double temp = value + FMAX - abs(value-FMAX);
#if FMIN == 0
return (temp + abs(temp)) * 0.25;
#else
return (temp + (2.0*FMIN) + abs(temp-(2.0*FMIN))) * 0.25;
#endif
}

Ternary operator is really the way to go, because most compilers are able to compile them into a native hardware operation that uses a conditional move instead of a branch (and thus avoids the mispredict penalty and pipeline bubbles and so on). Bit-manipulation is likely to cause a load-hit-store.
In particular, PPC and x86 with SSE2 have a hardware op that could be expressed as an intrinsic something like this:
double fsel( double a, double b, double c ) {
return a >= 0 ? b : c;
}
The advantage is that it does this inside the pipeline, without causing a branch. In fact, if your compiler uses the intrinsic, you can use it to implement your clamp directly:
inline double clamp ( double a, double min, double max )
{
a = fsel( a - min , a, min );
return fsel( a - max, max, a );
}
I strongly suggest you avoid bit-manipulation of doubles using integer operations. On most modern CPUs there is no direct means of moving data between double and int registers other than by taking a round trip to the dcache. This will cause a data hazard called a load-hit-store which basically empties out the CPU pipeline until the memory write has completed (usually around 40 cycles or so).
The exception to this is if the double values are already in memory and not in a register: in that case there is no danger of a load-hit-store. However your example indicates you've just calculated the double and returned it from a function which means it's likely to still be in XMM1.

For the 16.16 representation, the simple ternary is unlikely to be bettered speed-wise.
And for doubles, because you need it standard/portable C, bit-fiddling of any kind will end badly.
Even if a bit-fiddle was possible (which I doubt), you'd be relying on the binary representation of doubles. THIS (and their size) IS IMPLEMENTATION-DEPENDENT.
Possibly you could "guess" this using sizeof(double) and then comparing the layout of various double values against their common binary representations, but I think you're on a hiding to nothing.
The best rule is TELL THE COMPILER WHAT YOU WANT (ie ternary), and let it optimise for you.
EDIT: Humble pie time. I just tested quinmars idea (below), and it works - if you have IEEE-754 floats. This gave a speedup of about 20% on the code below. IObviously non-portable, but I think there may be a standardised way of asking your compiler if it uses IEEE754 float formats with a #IF...?
double FMIN = 3.13;
double FMAX = 300.44;
double FVAL[10] = {-100, 0.23, 1.24, 3.00, 3.5, 30.5, 50 ,100.22 ,200.22, 30000};
uint64 Lfmin = *(uint64 *)&FMIN;
uint64 Lfmax = *(uint64 *)&FMAX;
DWORD start = GetTickCount();
for (int j=0; j<10000000; ++j)
{
uint64 * pfvalue = (uint64 *)&FVAL[0];
for (int i=0; i<10; ++i)
*pfvalue++ = (*pfvalue < Lfmin) ? Lfmin : (*pfvalue > Lfmax) ? Lfmax : *pfvalue;
}
volatile DWORD hacktime = GetTickCount() - start;
for (int j=0; j<10000000; ++j)
{
double * pfvalue = &FVAL[0];
for (int i=0; i<10; ++i)
*pfvalue++ = (*pfvalue < FMIN) ? FMIN : (*pfvalue > FMAX) ? FMAX : *pfvalue;
}
volatile DWORD normaltime = GetTickCount() - (start + hacktime);

The bits of IEEE 754 floating point are ordered in a way that if you compare the bits interpreted as an integer you get the same results as if you would compare them as floats directly. So if you find or know a way to clamp integers you can use it for (IEEE 754) floats as well. Sorry, I don't know a faster way.
If you have the floats stored in an arrays you can consider to use some CPU extensions like SSE3, as rkj said. You can take a look at liboil it does all the dirty work for you. Keeps your program portable and uses faster cpu instructions if possible. (I'm not sure tho how OS/compiler-independent liboil is).

Rather than testing and branching, I normally use this format for clamping:
clampedA = fmin(fmax(a,MY_MIN),MY_MAX);
Although I have never done any performance analysis on the compiled code.

Realistically, no decent compiler will make a difference between an if() statement and a ?: expression. The code is simple enough that they'll be able to spot the possible paths. That said, your two examples are not identical. The equivalent code using ?: would be
a = (a > MAX) ? MAX : ((a < MIN) ? MIN : a);
as that avoid the A < MIN test when a > MAX. Now that could make a difference, as the compiler otherwise would have to spot the relation between the two tests.
If clamping is rare, you can test the need to clamp with a single test:
if (abs(a - (MAX+MIN)/2) > ((MAX-MIN)/2)) ...
E.g. with MIN=6 and MAX=10, this will first shift a down by 8, then check if it lies between -2 and +2. Whether this saves anything depends a lot on the relative cost of branching.

Here's a possibly faster implementation similar to #Roddy's answer:
typedef int64_t i_t;
typedef double f_t;
static inline
i_t i_tmin(i_t x, i_t y) {
return (y + ((x - y) & -(x < y))); // min(x, y)
}
static inline
i_t i_tmax(i_t x, i_t y) {
return (x - ((x - y) & -(x < y))); // max(x, y)
}
f_t clip_f_t(f_t f, f_t fmin, f_t fmax)
{
#ifndef TERNARY
assert(sizeof(i_t) == sizeof(f_t));
//assert(not (fmin < 0 and (f < 0 or is_negative_zero(f))));
//XXX assume IEEE-754 compliant system (lexicographically ordered floats)
//XXX break strict-aliasing rules
const i_t imin = *(i_t*)&fmin;
const i_t imax = *(i_t*)&fmax;
const i_t i = *(i_t*)&f;
const i_t iclipped = i_tmin(imax, i_tmax(i, imin));
#ifndef INT_TERNARY
return *(f_t *)&iclipped;
#else /* INT_TERNARY */
return i < imin ? fmin : (i > imax ? fmax : f);
#endif /* INT_TERNARY */
#else /* TERNARY */
return fmin > f ? fmin : (fmax < f ? fmax : f);
#endif /* TERNARY */
}
See Compute the minimum (min) or maximum (max) of two integers without branching and Comparing floating point numbers
The IEEE float and double formats were
designed so that the numbers are
“lexicographically ordered”, which –
in the words of IEEE architect William
Kahan means “if two floating-point
numbers in the same format are ordered
( say x < y ), then they are ordered
the same way when their bits are
reinterpreted as Sign-Magnitude
integers.”
A test program:
/** gcc -std=c99 -fno-strict-aliasing -O2 -lm -Wall *.c -o clip_double && clip_double */
#include <assert.h>
#include <iso646.h> // not, and
#include <math.h> // isnan()
#include <stdbool.h> // bool
#include <stdint.h> // int64_t
#include <stdio.h>
static
bool is_negative_zero(f_t x)
{
return x == 0 and 1/x < 0;
}
static inline
f_t range(f_t low, f_t f, f_t hi)
{
return fmax(low, fmin(f, hi));
}
static const f_t END = 0./0.;
#define TOSTR(f, fmin, fmax, ff) ((f) == (fmin) ? "min" : \
((f) == (fmax) ? "max" : \
(is_negative_zero(ff) ? "-0.": \
((f) == (ff) ? "f" : #f))))
static int test(f_t p[], f_t fmin, f_t fmax, f_t (*fun)(f_t, f_t, f_t))
{
assert(isnan(END));
int failed_count = 0;
for ( ; ; ++p) {
const f_t clipped = fun(*p, fmin, fmax), expected = range(fmin, *p, fmax);
if(clipped != expected and not (isnan(clipped) and isnan(expected))) {
failed_count++;
fprintf(stderr, "error: got: %s, expected: %s\t(min=%g, max=%g, f=%g)\n",
TOSTR(clipped, fmin, fmax, *p),
TOSTR(expected, fmin, fmax, *p), fmin, fmax, *p);
}
if (isnan(*p))
break;
}
return failed_count;
}
int main(void)
{
int failed_count = 0;
f_t arr[] = { -0., -1./0., 0., 1./0., 1., -1., 2,
2.1, -2.1, -0.1, END};
f_t minmax[][2] = { -1, 1, // min, max
0, 2, };
for (int i = 0; i < (sizeof(minmax) / sizeof(*minmax)); ++i)
failed_count += test(arr, minmax[i][0], minmax[i][1], clip_f_t);
return failed_count & 0xFF;
}
In console:
$ gcc -std=c99 -fno-strict-aliasing -O2 -lm *.c -o clip_double && ./clip_double
It prints:
error: got: min, expected: -0. (min=-1, max=1, f=0)
error: got: f, expected: min (min=-1, max=1, f=-1.#INF)
error: got: f, expected: min (min=-1, max=1, f=-2.1)
error: got: min, expected: f (min=-1, max=1, f=-0.1)

I tried the SSE approach to this myself, and the assembly output looked quite a bit cleaner, so I was encouraged at first, but after timing it thousands of times, it was actually quite a bit slower. It does indeed look like the VC++ compiler isn't smart enough to know what you're really intending, and it appears to move things back and forth between the XMM registers and memory when it shouldn't. That said, I don't know why the compiler isn't smart enough to use the SSE min/max instructions on the ternary operator when it seems to use SSE instructions for all floating point calculations anyway. On the other hand, if you're compiling for PowerPC, you can use the fsel intrinsic on the FP registers, and it's way faster.

As pointed out above, fmin/fmax functions work well (in gcc, with -ffast-math). Although gfortran has patterns to use IA instructions corresponding to max/min, g++ does not. In icc one must use instead std::min/max, because icc doesn't allow short-cutting the specification of how fmin/fmax work with non-finite operands.

My 2 cents in C++. Probably not any different than use ternary operators and hopefully no branching code is generated
template <typename T>
inline T clamp(T val, T lo, T hi) {
return std::max(lo, std::min(hi, val));
}

If I understand properly, you want to limit a value "a" to a range between MY_MIN and MY_MAX. The type of "a" is a double. You did not specify the type of MY_MIN or MY_MAX.
The simple expression:
clampedA = (a > MY_MAX)? MY_MAX : (a < MY_MIN)? MY_MIN : a;
should do the trick.
I think there may be a small optimization to be made if MY_MAX and MY_MIN happen to be integers:
int b = (int)a;
clampedA = (b > MY_MAX)? (double)MY_MAX : (b < MY_MIN)? (double)MY_MIN : a;
By changing to integer comparisons, it is possible you might get a slight speed advantage.

If you want to use fast absolute value instructions, check out this snipped of code I found in minicomputer, which clamps a float to the range [0,1]
clamped = 0.5*(fabs(x)-fabs(x-1.0f) + 1.0f);
(I simplified the code a bit). We can think about it as taking two values, one reflected to be >0
fabs(x)
and the other reflected about 1.0 to be <1.0
1.0-fabs(x-1.0)
And we take the average of them. If it is in range, then both values will be the same as x, so their average will again be x. If it is out of range, then one of the values will be x, and the other will be x flipped over the "boundary" point, so their average will be precisely the boundary point.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight