C code optimization - c

I'm trying to optimize some C code, and it's my first time.
As a first step i dumped my executable file in order to see the assembler code.
For example for this function:
void init_twiddle(int N)
{
int i=0;
for(i=0; i<ELEMENTS_HALF; i++)
{
twiddle_table[i].re = (float) cos((float)i * 2.0 * PI / (float)N);
twiddle_table[i].im = (float) - sin((float)i * 2.0 * PI / (float)N);
}
}
wouldn't be better if i do this instead:
void init_twiddle(int N)
{
int i=0;
float puls = 2.0 * PI / (float)N;
for(i=0; i<ELEMENTS_HALF; i++)
{
twiddle_table[i].re = (float) cos((float)i * puls);
twiddle_table[i].im = (float) - sin((float)i * puls);
}
in order to avoid mult and div operation of being repeated thousands of times?
}

Unfortunately, your first step was already kindof wrong.
Don't blindly walk through your code optimizing arbitrary loops which might or (more probably) might not affect performance (maybe because that code is so rarely called that it doesn't really use any run-time).
Optimizing means: You need to find out first where is the time spent in my program? Use timing measurements to narrow down where your program spends most of its time (you can use homegrown logging using us timers or a profiling application for that). Without at least some figures you wouldn't even see where the compiler has already helped you and maybe has already maxed out all possibilities, even if your code looks like it has some potential left for being faster (modern compilers are really very good at that).
Only if you know the hot spots in your application you should start optimizing those.

The problem is that it is a floating point expression and floating point operations are not commutative. So the optimization is invalid in general for any compiler that follows IEEE 754. So either you have to do this optimization manually, or you have to tell the compiler to treat floating point as commutative for optimization purposes (in gcc and clang you use -ffast-math to do this). This will introduce slight changes in the resulting values.
For comparison of the assembly:
Without -ffast-math
With -ffast-math

You can do this much faster, indeed you need only 1 sine and 1 cosine (which are disastrously slow). What you're actually doing is calculating the coordinates of a little vector that you spin around the origin, the alternative way to do it is by actually spinning that vector around the origin, one step at the time. The rotation matrix for a single step is what costs the single sine and cosine.
Of course this may be a bit less accurate, but no significant trouble should build up if you make a a reasonable number of steps.

The root of all evil is premature optimization
– Donald Knuth
You should optimize, if you have a problem with execution duration. There are tools that record the duration of every single statement or at least function call.
I think that most compilers detect such constant expressions in a loop and there is nothing to optimize, because it is already optimized.

First of all, use double, not float. In C, library routines are all in double, so you're just doing a lot of converting.
Second, calculate the angle once and put it in a variable, not twice.
Maybe the compiler recognizes that it can do this for you, but I prefer not to tempt the compiler not to.
Third, is there a sincos function? The sine and cosine functions are closely related, so one can be calculated at the same time as the other.
Fourth, when thinking about performance, switch your brain to thinking in percent of total time, not doing something "thousands of times". That way, you will concentrate on what has the greatest overall benefit, not things that might well be irrelevant.

This probably won't change your code performance, since this a standard loop invariants optimization that is performed by any standard compiler (assuming optimizations aren't turned off)..

Related

What optimizations should be left for compiler?

Assume that you have chosen the most efficient algorithm for solving a problem where performance is the first priority, and now that you're implementing it you have to decide about details like this:
v[i*3+0], v[i*3+1] and v[i*3+2] contain the components of the velocity of particle i and we want to calculate the total kinetic energy. Given that all particles are of the same mass, one may write:
inline double sqr(double x)
{
return x*x;
}
double get_kinetic_energy(double v[], int n)
{
double sum = 0.0;
for (int i=0; i < n; i++)
sum += sqr(v[i*3+0]) + sqr(v[i*3+1]) + sqr(v[i*3+2]);
return 0.5 * mass * sum;
}
To reduce the number of multiplications, it can be written as:
double get_kinetic_energy(double v[], int n)
{
double sum = 0.0;
for (int i=0; i < n; i++)
{
double *w = v + i*3;
sum += sqr(w[0]) + sqr(w[1]) + sqr(w[2]);
}
return 0.5 * mass * sum;
}
(one may write a function with even fewer multiplications, but that's not the point of this question)
Now my question is: Since many C compilers can do this kind of optimizations automatically, where should the developer rely on the compiler and where should she/he try to do some optimization manually?
where should the developer rely on the compiler and where should she/he try to do some optimization manually?
Do I have fairly in-depth knowledge of the target hardware as well as how C code translates to assembler? If no, forget about manual optimizations.
Are there any obvious bottlenecks in this code - how do I know that it needs optimization in the first place? Obvious culprits are I/O, complex loops, busy-wait loops, naive algorithms etc.
When I found this bottleneck, how exactly did I benchmark it and am I certain that the problem doesn't lie in the benchmarking method itself? Experience from SO shows that some 9 out of 10 strange performance questions can be explained by incorrect benchmarking. Including: benchmarking with compiler optimizations disabled...
From there on you can start looking at system-specific things as well as the algorithms themselves - there's far too many things to look at to cover in an SO answer. It's a huge difference between optimizing code for a low-end microcontroller and a 64-bit desktop PC (and everything in between).
One thing that looks a bit like premature optimization, but could just be ignorance of language abilities is that you have all of the information to describe particles flattened into an array of double values.
I would suggest instead that you break this down, making your code easier to read by creating a struct to hold the three datapoints on each particle. At that point you can create functions which take a single particle or multiple particles and do computations on them.
This will be much easier for you than having to pass three times the number of particles arguments to functions, or trying to "slice" the array. If it's easier for you to reason about, you're less likely to generate warnings/errors.
Looking at how both gcc and clang handle your code, the micro optimisation you contemplate is vain. The compilers already apply standard common subexpression elimination techniques that remove to overhead you are trying to eliminate.
As a matter of fact, the code generated handles 2 components at a time using XMM registers.
If performance is a must, then here are steps that will save the day:
the real judge is the wall clock. Write a benchmark with realistic data and measure performance until you get consistent results.
if you have a profiler, use it to determine where are the bottlenecks if any. Changing algorithms for those parts that appear to hog performance is an effective approach.
try and get the best from the compiler: study the optimization options and try and let the compiler use more aggressive techniques if they are appropriate for the target system. For example -mavx512f -mavx512cd let the gcc generate code that handles 8 components at a time using the 512-bit ZMM registers.
This is a non intrusive technique as the code does not change, so you don't risk introducing new bugs by hand optimizing the code.
Optimisation is a difficult art. In my experience, simplifying the code gets better results and far fewer bugs than adding extra subtle stuff to try and improve performance at the cost of readability and correctness.
Looking at the code, an obvious simplification seems to generate the same results and might facilitate the optimizer's job (but again, let the wall clock be the judge):
double get_kinetic_energy(const double v[], int n, double mass)
{
double sum = 0.0;
for (int i = 0; i < 3 * n; i++)
sum += v[i] * v[i];
return 0.5 * mass * sum;
}
Compilers like clang and gcc are simultaneously far more capable and far less capable than a lot of people give them credit for.
They have an exceptionally wide range of patterns where they can transform code into an alternative form which is likely to be more efficient and still behave as required.
Neither, however, is especially good at judging when optimizations will be useful. Both are prone to making some "optimization" decisions that are almost comically absurd.
For example, given
void test(char *p)
{
char *e = p+5;
do
{
p[0] = p[1];
p++;
}while(p < e);
}
when targeting the Cortex-M0 with an optimization level below -O2, gcc 10.2.1 will generate code equivalent to calling memmove(p, p+1, 7);. While it would be theoretically possible that a library implementation of memmove might optimize the n==7 case in such a way as to outperform the five-instruction byte-based loop generated at -Og (or even -O0), it would seem far more likely that any plausible implementations would spend some time analyzing what needs to be done, and then after doing that spend just as long executing the loop as would code generated using -O0.
What happens, in essence, is that gcc analyzes the loop, figures out what it's trying to do, and then uses its own recipe to perform that action in a manner that may or may not be any better than what the programmer was trying to do in the first place.

What is 'differential timing' technique for doing benchmarks?

At around 39 minute of "Writing Fast Code I" by Andrei Alexandrescu (link here to youtube)
there is a slide of how to use differential timing... can someone show me some basic code with this approach? It was only mentioned for a second, but I think that's an interesting idea.
Run baseline 2n times (t2a)
vs. baseline n times + contender n times (ta+b).
Relative improvement = "t2a / (2ta+b - t2a)"
some overhead noises canceled
Alexsandrescu's slide is rather trivial to pour into code:
auto start = clock::now();
for( int i = 0; i < 2*n; i++ )
baseline();
auto t2a = clock::now() - start;
start = clock::now();
for( int i = 0; i < n; i++ )
baseline();
// *
for( int i = 0; i < n; i++ )
contender();
auto taplusb = clock::now() - start;
double r = t2a / (2 * taplusb - t2a) // relative speedup
* Synchronization point which prevents optimization across the last two loops.
I would be more interested in the mathematical reasoning behind measuring the relative speed up this way as opposed to simply tBaseline / tContender as I've been doing for ever. He only vaguely hints at '...overhead noise (being) cancelled (out)', but doesn't explain it in detail.
If you keep watching until 41:40 or so, he mentions it again when warning about the pitfall of first run vs. subsequent (allocators warmed up, etc.)
The best solution for that is doing warm-up runs before the first timed region.
I think he's picturing that 2n baseline vs. n baseline + n contender in separate invocations of the benchmark program.
So instead of doing some warmup runs before the timed region, he's using the baseline as a controlled warmup inside the timed region. This might make it possible to just time the whole program, e.g. perf stat, instead of calling a time function inside the program. Depending on how much process startup overhead your OS has vs. how long you make your repeat loop.
Microbenchmarking is hard and there are many pitfalls. Notably benchmarking optimized code while still making sure there isn't optimization between iterations of your repeat loop. (Often it's useful to use inline asm "escape" macros to force the compiler to materialize a value in an integer register, and/or to forget about the value of a variable to defeat CSE. Sometimes it's sufficient to just add the final result of each iteration to a sum that you print at the end.)
This is the first I've heard of this differential idea. It doesn't sound more useful than normal warm-ups.
If anything it will make the contender look slightly worse than using the function under test for some warm-up runs before the timed region. Using the same function as the timed region will warm up branch-prediction for it. Or not because after inlining the warm-up vs. main versions will be at different addresses. The same pattern at different addresses may possibly still help a modern TAGE predictor but IDK.
Or if contender has any lookup tables, those will become hot in cache from the warmup.
In any case, warmups are essential, unless you make the repeat count long enough to dwarf the time it takes for the CPU to switch to max turbo and so on. And to page-fault in all the memory you touch.
If your calculated time/iteration doesn't stay constant with your repeat count, your microbenchmark is broken.
Take the rest of his advice with a grain of salt, too. Most of it use useful (e.g. prefer 32-bit integers even for local temporaries, not just for arrays for cache-footprint reasons), but the reasoning is wrong for some of them.
His explanation that an ALU can do 2x 32-bit adds or 1x 64-bit add only applies to SIMD: 4x 32-bit int in a vector for paddd or 2x 64-bit int in a vector for paddq. But x86 scalar add r32, r32 has the same throughput as add r64,r64. I don't think it was true even on Pentium 4 (Nocona) despite P4 having funky double-pumped ALUs with 0.5 cycle latency for add. At least before Prescott/Nocona which introduced 64-bit support.
Using 32-bit unsigned integers on x86-64 can stop the compiler from optimizing to pointer increments if it wants to. It has to maintain correctness in case of 32-bit wraparound of a variable before array indexing.
Using 16-bit or 8-bit locals to match the data in an array can sometimes help auto-vectorization, IIRC. Gcc/clang sometimes make really braindead code that unpacks to 32-bit and then re-packs down to 8-bit elements, when processing an array of int8_t or uint8_t. I forget if I've every actually worked around that by using narrow locals, though. C default integer promotions bring most expressions back up to 32-bit.
Also, at https://youtu.be/vrfYLlR8X8k?t=3498, he claims that FP->int is expensive. That's never been true on x86-64: FP math uses SSE/SSE2 which has an instruction that does truncating conversion. FP->int used to be slow in the bad old days of x87 math, where you had to change the FP rounding mode, fistp, then change it back, to get C truncation semantics. But SSE includes cvttsd2si exactly for that common case.
He also says float is no faster than double. That's true for scalar (other than div/sqrt), but if your code can auto-vectorize then you get twice as much work done per instruction and the instructions have the same throughput. (Twice as many elements fit in a SIMD vector.)
How the math works:
It just cancels out the n * baseline time from both parts, effectively doing (2 * baseline) / (2*contender) = baseline/contender.
It assumes that the times add normally (not overlapping computation). t_2a = 2 * baseline, and 2 * t_ab = 2 * baseline + 2 * contender. Subtracting cancels the 2*baseline parts, leaving you with 2*contender.
The trick isn't in the math, if anything this is more mathematically dangerous because subtracting two larger numbers accumulates error. i.e. if the n*baseline actually takes different amounts of time in the two runs (because you didn't control that perfectly), then it doesn't cancel and contributes error to your estimate.

How does an ahead-of-time compiler determine how far to go with optimization?

Optimization in modern compilers is getting better and better with basic optimizations like constant folding to utilizing SIMD instructions. However, I wonder how far these kind of optimizations should be taken and how this decision is made by compilers nowadays.
Let's look at an example:
#include <stdio.h>
static double naive_sin(double n) {
return n - n*n*n / 6.0 + n*n*n*n*n / 120.0 + n*n*n*n*n*n*n / 5040.0;
}
int main() {
printf("%f\n", naive_sin(1.0));
return 0;
}
When compiling this with GCC with -O3, it can be observed that the resulting floating point number is calculated by the compiler and stored in the source code. Optimizing further than that is obviously not possible.
Now, let's look at a second example:
#include <stdio.h>
int main() {
double start = 0.0;
for (int i = 0; i < 100; i++) {
start += 1.0;
}
printf("%f\n", start);
return 0;
}
With the result of the first example in mind, one could expect the compiler to apply similar optimization and produce the constant 100.0 in the resulting machine code. However, when looking at the output, it turns out that the loop is still there!
Obviously this kind of optimization is not always possible. Let's say you were writing a program that calculates pi to a million places. Such a program requires no user input, so theoretically the result could be hardcoded into the machine code by the compiler. Of course this is not a good idea, because the compiler will take much longer to internally evaluate a program like that as opposed to just running the less optimized version.
Still, what makes the compiler decide to not optimize the loop in this case? Are there languages/compilers that optimize this kind of code or is there something preventing this? Is it perhaps related to the concept of not being able to predict if a program is ever going to end?
It's really just a question of which optimizations are enabled, and what optimizations are actually available in your compiler. Some optimizations, like function inlining and constant propagation, are basically universally available and relatively easy to implement. So, the majority of compilers will optimize the first program with most optimization settings.
The second program requires loop analysis and loop elimination to optimize, which is much trickier to do. A compiler probably could optimize the second program, but your compiler most likely doesn't have the mechanisms for optimizing such a loop (proving the correctness of float optimizations is often a lot trickier than proving the correctness of integer optimizations). Note that my version of GCC does optimize the loop if start is declared as an int.

Performance of math functions?

I'm working with graphing accelerometer data here, and I'm trying to correct for gravity. To do this, I get the acceleration vector in spherical coordinates, decrease the radius by 1g, and convert back to cartesian. This method is called on a timer every 0.03 seconds:
//poll accleration
ThreeAxisAcceleration current = self.accelerationData;
//math to correct for gravity:
float radius = sqrt(pow(current.x, 2) + pow(current.y, 2) + pow(current.z, 2));
float theta = atan2(current.y, current.x);
float phi = acos(current.z/radius);
//NSLog(#"SPHERICAL--- RADIUS: %.2f -- THETA: %.2f -- PHI: %.2f", radius, theta, phi);
radius = radius - 1.0;
float newX = radius * cos(theta) * sin(phi);
float newY = radius * sin(theta) * sin(phi);
float newZ = radius * cos(phi);
current = (ThreeAxisAcceleration){newX, newY, newZ};
//end math
NSValue *arrayVal = [NSValue value:&current withObjCType:#encode(ThreeAxisAcceleration)];
if ([_dataHistoryBuffer count] > self.bounds.size.width) {
[_dataHistoryBuffer removeObjectAtIndex:0];
}
[_dataHistoryBuffer addObject:arrayVal];
[self setNeedsDisplay];
Somehow, the addition of the gravity correction is gradually slowing my code horrendously. I find it hard to believe that this amount of math can slow down the program, but yet without it it can still run through my entire display method which is quite lengthy. Are there any options I can consider here to avoid this? Am I missing something or is the math just that slow? I can comment out between the //math and //end math tags and be just fine.
Thanks for any help.
P.S. incase it may matter, to whom it may interest, I'm programming in cocoa, and this method belongs to a subclass of CALayer, with -drawInContext: implemented.
Are you on iPhone? Try using the float variants of these functions: powf, sqrtf, etc
There's more info in point #4 of Kendall Helmstetter Gelner's answer to this SO question.
Besides the fact that it's theoretically impossible to simply factor out Earth's gravity, the first step I would take would be to benchmark each of the operations that you're performing (multiplication, division, sin, atan2, etc) and then engineer a way around the operations that take significantly longer to compute (or avoid computing the problematic operations). Make sure to use the same data types in your benchmarking as you will in your finished product.
This is a classic example of the time/accuracy trade-off. There are usually multiple algorithms for performing the same computation and you also have LUTs/interpolation at your disposal.
I ran into the same issues when I made my own Wii-style remote controller. If you identify the expensive operation and are having trouble engineering around it then start another question. :)
The normal way to shorten a vector would be along the lines of:
float originalMagnitude = sqrtf(current.x * current.x, current.y * current.y, current.z* current.z);
float desiredMagnitude = originalMagnitude - 1.0f;
float scaleFactor = (originalMagnitude != 0) ? desiredMagnitude / originalMagnitude : 0.0f; // avoid divide-by-zero
current.x *= scaleFactor;
current.y *= scaleFactor;
current.z *= scaleFactor;
That said, no, calling a few trig functions 33 times a second shouldn’t be slowing you down much. On the other hand, -[NSMutableArray removeObjectAtIndex:] could potentially be slow for a big array. A ring buffer (either using NSMutableArray or a C array of structs) would be more efficient.
Profile, don't speculate. Don't change a damn thing until you know what to change.
Assuming that you get a profile that shows that all the math really is slowing you down:
Don't ever write pow(someFloat,2). The compiler should be able to optimize this away for you, but often times, on newer hardware, those optimizations may not yet be in place. This should always be written someFloat*someFloat. The pow( ) function is generally the most expensive function in the math library. Simple multiplication will always be at least as fast as calling pow( ), and will always be at least as accurate (assuming IEEE-754 compliant arithmetic). Plus, it's easier for the compiler to optimize.
When working with floats in C, use the suffixed forms of the math library function. sinf is faster than sin. sqrtf is faster than sqrt. Beyond the functions themselves being faster, you avoid unnecessary conversions to and from double.
If you're seeing the slowdown on a ARMv6 processor (not the 3GS or the new iPod Touch), make sure you are not compiling to thumb code when you are doing a lot of floating-point computation. The thumb instruction set (prior to thumb2) cannot access the VFP registers, and thus needs a shim for every floating point operation. This can be quite expensive.
If you just want to decrease the length of the acceleration vector by 1.0 (hint: this doesn't do what you want), there are more efficient algorithms to do so.
Those math lines look fine. I don't know enough Objective C to know what the current = ... line is doing though. Could it be allocating memory on the heap which isn't being reclaimed?
What happens if you just comment it out? Have you watched the processes' execution with top, to see if it starts slurping more CPU or memory?
Other than the other commentor's use of the floating point (as opposed to double operators), doing all that _dataHistoryBuffer stuff will be what's killing your app. That'll churn up the memory like there's no tomorrow, and since you are using the NSValue, then all those objects will be added to the autorelease pool making memory consumption spike. You're much better off avoiding keeping a list of values unless you really, really need it - and if so, figuring out a more appropriate (read: fixed size, non-object) mechanism to store them in. Even a circular buffer of structs (e.g. an array of 10 structs, and then have a counter which does i++ % 10 to index into it) would be better.
Profile it to see exactly where the problem is. If necessary, comment out a subset of the "math" part at a time. Performance is something people usually guess wrong, even smart, thoughtful, experienced people.
Just out of interest - do you know how the Math SQRT function is implemented? If it is using an inefficient approximation algorithm, then it might be your culprit. Your best option is to create some sort of test harness that can get an average performance for each of the instructions that you are using.
Another question - does increasing or reducing the precision of the operators (i.e. by using double value floats rather than singles) change the performance in any way?
As others have said, you should profile to be sure. Having said that, yes, it is quite likely that adding the extra calculations did slow it down.
By default, all code for iPhone is compiled for the Thumb-1 instruction set. Thumb-1 does not have native floating point support, so it ends up calling out to a SOFTWARE floating point implementation. There are 2 ways to handle this:
Compile the code for ARM. The processor in the iPhone can freely intermix Thumb and ARM code, so you can just compile the the necessary pieces as ARM. You should note that GCC (and by proxy Xcode) cannot compile an individual function as ARM, you will need to isolate all the relevent code into its on compilation unit. It is probably easiest just to set the entire project to compile for ARM to see if it fixes things (Uncheck "Build Options" > "Compile for Thumb"). You should note that while ARM will speed up floating point, it reduces instruction density thereby hurting cache efficiency and degrading all of your other code, so try to avoid it where you can.
Compile for Thumb-2. Thumb-2 is an enhanced version of Thumb that adds support for some floating point operations. It is only available on iPhone 3GS and the new iPod Touch, so this may not be an option for you. You can do that by switching your architecture to "Optimized," which will build a fat binary with current slow version for older devices, and the faster version for ones that support.
You can also combine both of these options, if that seems like the best choice.
Unless I misunderstand your code, you basically scale your point by some factor.
I think the following code should be equivalent to what you do.
double radius = sqrt(current.x * current.x
+ current.y * current.y
+ current.z * current.z);
double newRadius = radius - 1.0;
double scale = newRadius/radius;
current.x *= scale;
current.y *= scale;
current.z *= scale;
This method will find out what the problem is. The worse your slowdown is, the quicker it will find it. Guesses are things that you suspect but don't know, such as thinking the math is the problem. Guesses are usually wrong, at least to begin with. If you are right, the samples will show you. If you are wrong, they will show you what is right. It never misses.
My guess is since you're using autoreleased memory (for NSValue) every 0.03 seconds you're probably not giving the pool much time to release itself. I could be wrong - profiling is the only way to tell.
Try manually allocating and releasing the NSValue and see if it makes a difference.

Is it still worth trying to create optimizations for sqrt() in C?

Are the old tricks (lookup table, approx functions) for creating faster implementations of sqrt() still useful, or is the default implementation as fast as it is going to get with modern compilers and hardware?
Rule 1: profile before optimizing
Before investing any effort in the belief that you can beat the optimizer, you must profile everything and discover where the bottleneck really lies. In general, it is unlikely that sqrt() itself is your bottleneck.
Rule 2: replace the algorithm before replacing a standard function
Even if sqrt() is the bottleneck, then it is still reasonably likely that there are algorithmic approaches (such as sorting distances by length squared which is easily computed without a call to any math function) that can eliminate the need to call sqrt() in the first place.
What the compiler does for you if you do nothing else
Many modern C compilers are willing to inline CRT functions at higher optimization levels, making the natural expression including calls to sqrt() as fast as it needs to be.
In particular, I checked MinGW gcc v3.4.5 and it replaced a call to sqrt() with inline code that shuffled the FPU state and at the core used the FSQRT instruction. Thanks to the way that the C standard interacts with IEEE 754 floating point, it did have to follow the FSQRT with some code to check for exceptional conditions and a call to the real sqrt() function from the runtime library so that floating point exceptions can be handled by the library as required by the standard.
With sqrt() inline and used in the context of a larger all double expression, the result is as efficient as possible given the constraints of of standards compliance and preservation of full precision.
For this (very common) combination of compiler and target platform and given no knowledge of the use case, this result is pretty good, and the code is clear and maintainable.
In practice, any tricks will make the code less clear, and likely less maintainable. After all, would you rather maintain (-b + sqrt(b*b - 4.*a*c)) / (2*a) or an opaque block of inline assembly and tables?
Also, in practice, you can generally count on the compiler and library authors to take good advantage of your platform's capabilities, and usually to know more than you do about the subtleties of optimizations.
However, on rare occasions, it is possible to do better.
One such occasion is in calculations where you know how much precision you really need and also know that you aren't depending on the the C standard's floating point exception handling and can get along with what the hardware platform supplies instead.
Edit: I rearranged the text a bit to put emphasis on profiling and algorithms as suggested by Jonathan Leffler in comments. Thanks, Jonathan.
Edit2: Fixed precedence typo in the quadratic example spotted by kmm's sharp eyes.
Sqrt is basically unchanged on most systems. It's a relatively slow operation, but the total system speeds have improved, so it may not be worth trying to use "tricks".
The decision to optimize it with approximations for the (minor) gains this can achieve are really up to you. Modern hardware has eliminated some of the need for these types of sacrifices (speed vs. precision), but in certain situations, this is still valuable.
I'd use profiling to determine whether this is "still useful".
If you have proven that the call to sqrt() in your code is a bottleneck with a profiler then it may be worth trying to create an optimizated version. Otherwise it's a waste of time.
This probably is the fastest method of computing the square root:
float fastsqrt(float val) {
union
{
int tmp;
float val;
} u;
u.val = val;
u.tmp -= 1<<23; /* Remove last bit so 1.0 gives 1.0 */
/* tmp is now an approximation to logbase2(val) */
u.tmp >>= 1; /* divide by 2 */
u.tmp += 1<<29; /* add 64 to exponent: (e+127)/2 =(e/2)+63, */
/* that represents (e/2)-64 but we want e/2 */
return u.val;
}
wikipedia article
This probably is the fastest method of computing the inverse square root. Assume at most 0.00175228 error.
float InvSqrt (float x)
{
float xhalf = 0.5f*x;
int i = *(int*)&x;
i = 0x5f3759df - (i>>1);
x = *(float*)&i;
return x*(1.5f - xhalf*x*x);
}
This is (very roughly) about 4 times faster than (float)(1.0/sqrt(x))
wikipedia article
It is generally safe to assume that the standard library developers are quite clever, and have written performant code. You're unlikely to be able to match them in general.
So the question becomes, do you know something that'll let you do a better job? I'm not asking about special algorithms for computing the square root (the standard library developers knows of these too, and if they were worthwhile in general, they'd have used them already), but do you have any specific information about your use case, that changes the situation?
Do you only need limited precision? If so, you can speed it up compared to the standard library version, which has to be accurate.
Or do you know that your application will always run on a specific type of CPU? Then you can look at how efficient that CPU's sqrt instruction is, and see if there are better alternatives. Of course, the downside to this is that if I run your app on another CPU, your code might turn out slower than the standard sqrt().
Can you make assumptions in your code, that the standard library developers couldn't?
You're unlikely to be able to come up with a better solution to the problem "implement an efficient replacement for the standard library sqrt".
But you might be able to come up with a solution to the problem "implement an efficient square root function for this specific situation".
Why not? You probably learn a lot!
I find it very hard to believe that the sqrt function is your application's bottleneck because of the way modern computers are designed. Assuming this isn't a question in reference to some crazy low end processor, you take a tremendous speed hit to access memory outside of your CPU caches, so unless you're algorithm is doing math on a very few numbers (enough that they all basically fit within the L1 and L2 caches) you're not going to notice any speed up from optimizing any of your arithmetic.
I still find it useful even now, though this is the context of normalizing a million+ vectors every frame in response to deforming meshes.
That said, I'm generally not creating my own optimizations but relying on a crude approximation of inverse square root provided as a SIMD instruction: rsqrtps. That is still really useful in speeding up some real-world cases if you're willing to sacrifice precision for speed. Using rsqrtps can actually reduce the entirety of the operation which includes deforming and normalizing vertex normals to almost half the time, but at the cost of the precision of the results (that said, in ways that can barely be noticed by the human eye).
I've also still found the fast inverse sqrt as often credited incorrectly to John Carmack to still improve performance in scalar cases, though I don't use it much nowadays. It's generally natural to get some speed boost if you're willing to sacrifice accuracy. That said, I wouldn't even attempt to beat C's sqrt if you aren't trying to sacrifice precision for speed.
You generally have to sacrifice the generality of the solution (like its precision) if you want to beat standard implementations, and that tends to apply whether it's a mathematical function or, say, malloc. I can easily beat malloc with a narrowly-applicable free list lacking thread-safety that's suitable for very specific contexts. It's another thing to beat it with a general-purpose allocator which can allocate variable-sized chunks of memory and free any one of them at any given time.

Resources