I am trying to find the gcd of two numbers using two approach one is substraction
int gcd2(int a, int b)
{
if (a == 0)
return b;
else
printf("Iteration\n");
if(a>=b)
a=a-b;
else
b=b-a;
return gcd2(min(a,b),max(a,b));
}
and the other one is by modulo operation
int gcd1(int a, int b)
{
if (a == 0)
return b;
else
printf("Iteration\n");
return gcd1(b%a, a);
}
i know that the number of iteration in gcd2 is more then gcd1 but in gcd1 i am using modulo operation which is also costly so i wanted to know are these two approach same in terms of run time .
Knuth covers gcd extensively in Volume 2 of "The Art of Computer Programming" section 4.5.2
His version of binary gcd is more sophisticated and uses these facts:
a) if u and v are even, gcd(u, v)=2gcd(u/2, v/2)
b) if u is even and v is odd, gcd(u, v) = gcd(u/2, v)
c) As in Euclid's algorithm, gcd(u, v) = gcd(u-v, v)
d) if u and v are both odd, then u-v is even and |u-v| < max(u, v).
For his model computer MIX, binary gcd is about 20% faster than Euclidean gcd. Your mileage may vary.
For most practical purposes the version that uses the modulo operator should be faster, for two reasons:
(1) the subtraction-based version has to iterate more often and thus incurs more branch mis-predictions
(2) the modulo-based approach (and the binary version) are less vulnerable to the performance hitch mentioned by Henry, which occurs when one operand is much larger than the other
Also, the modulo-based and the shift-based version shift more work into specialised circuitry inside the processor, which gives them even more of an edge. The binary version is more complex to code; it can easily be made faster in languages that are close to the metal, like C/C++ or assembler, but not so easily in languages that are further away from the coal face. One reason is that C/C++ compilers can avoid some branches by employing conditional move instructions (e.g. CMOV) or at least allow equivalent bit trickery; compilers for other languages tend to lag behind C/C++ unless they share the same backend (as with the gnu compilers).
Things like python are a different story altogether, since the interpreter overhead dwarfs the instruction-level timing differences between additive ops, shift ops and mul/div ops, or the cost of branching (and branch mis-predictions).
That's a general view on things; it is always possible to construct pathological inputs that can prove any version to be inferior or superior.
In any case, Dijkstra's dictum regarding premature optimisation should be taken seriously. Barring proof to the contrary, the best code is the code that is so simple and clear that it is difficult to get wrong. Stick with the modulo version until you have proof that it's too slow. If and when you do have such proof, come back and we'll speed things up (since then there will be specifics known, specifics that can be leveraged).
Related
Assume that you have chosen the most efficient algorithm for solving a problem where performance is the first priority, and now that you're implementing it you have to decide about details like this:
v[i*3+0], v[i*3+1] and v[i*3+2] contain the components of the velocity of particle i and we want to calculate the total kinetic energy. Given that all particles are of the same mass, one may write:
inline double sqr(double x)
{
return x*x;
}
double get_kinetic_energy(double v[], int n)
{
double sum = 0.0;
for (int i=0; i < n; i++)
sum += sqr(v[i*3+0]) + sqr(v[i*3+1]) + sqr(v[i*3+2]);
return 0.5 * mass * sum;
}
To reduce the number of multiplications, it can be written as:
double get_kinetic_energy(double v[], int n)
{
double sum = 0.0;
for (int i=0; i < n; i++)
{
double *w = v + i*3;
sum += sqr(w[0]) + sqr(w[1]) + sqr(w[2]);
}
return 0.5 * mass * sum;
}
(one may write a function with even fewer multiplications, but that's not the point of this question)
Now my question is: Since many C compilers can do this kind of optimizations automatically, where should the developer rely on the compiler and where should she/he try to do some optimization manually?
where should the developer rely on the compiler and where should she/he try to do some optimization manually?
Do I have fairly in-depth knowledge of the target hardware as well as how C code translates to assembler? If no, forget about manual optimizations.
Are there any obvious bottlenecks in this code - how do I know that it needs optimization in the first place? Obvious culprits are I/O, complex loops, busy-wait loops, naive algorithms etc.
When I found this bottleneck, how exactly did I benchmark it and am I certain that the problem doesn't lie in the benchmarking method itself? Experience from SO shows that some 9 out of 10 strange performance questions can be explained by incorrect benchmarking. Including: benchmarking with compiler optimizations disabled...
From there on you can start looking at system-specific things as well as the algorithms themselves - there's far too many things to look at to cover in an SO answer. It's a huge difference between optimizing code for a low-end microcontroller and a 64-bit desktop PC (and everything in between).
One thing that looks a bit like premature optimization, but could just be ignorance of language abilities is that you have all of the information to describe particles flattened into an array of double values.
I would suggest instead that you break this down, making your code easier to read by creating a struct to hold the three datapoints on each particle. At that point you can create functions which take a single particle or multiple particles and do computations on them.
This will be much easier for you than having to pass three times the number of particles arguments to functions, or trying to "slice" the array. If it's easier for you to reason about, you're less likely to generate warnings/errors.
Looking at how both gcc and clang handle your code, the micro optimisation you contemplate is vain. The compilers already apply standard common subexpression elimination techniques that remove to overhead you are trying to eliminate.
As a matter of fact, the code generated handles 2 components at a time using XMM registers.
If performance is a must, then here are steps that will save the day:
the real judge is the wall clock. Write a benchmark with realistic data and measure performance until you get consistent results.
if you have a profiler, use it to determine where are the bottlenecks if any. Changing algorithms for those parts that appear to hog performance is an effective approach.
try and get the best from the compiler: study the optimization options and try and let the compiler use more aggressive techniques if they are appropriate for the target system. For example -mavx512f -mavx512cd let the gcc generate code that handles 8 components at a time using the 512-bit ZMM registers.
This is a non intrusive technique as the code does not change, so you don't risk introducing new bugs by hand optimizing the code.
Optimisation is a difficult art. In my experience, simplifying the code gets better results and far fewer bugs than adding extra subtle stuff to try and improve performance at the cost of readability and correctness.
Looking at the code, an obvious simplification seems to generate the same results and might facilitate the optimizer's job (but again, let the wall clock be the judge):
double get_kinetic_energy(const double v[], int n, double mass)
{
double sum = 0.0;
for (int i = 0; i < 3 * n; i++)
sum += v[i] * v[i];
return 0.5 * mass * sum;
}
Compilers like clang and gcc are simultaneously far more capable and far less capable than a lot of people give them credit for.
They have an exceptionally wide range of patterns where they can transform code into an alternative form which is likely to be more efficient and still behave as required.
Neither, however, is especially good at judging when optimizations will be useful. Both are prone to making some "optimization" decisions that are almost comically absurd.
For example, given
void test(char *p)
{
char *e = p+5;
do
{
p[0] = p[1];
p++;
}while(p < e);
}
when targeting the Cortex-M0 with an optimization level below -O2, gcc 10.2.1 will generate code equivalent to calling memmove(p, p+1, 7);. While it would be theoretically possible that a library implementation of memmove might optimize the n==7 case in such a way as to outperform the five-instruction byte-based loop generated at -Og (or even -O0), it would seem far more likely that any plausible implementations would spend some time analyzing what needs to be done, and then after doing that spend just as long executing the loop as would code generated using -O0.
What happens, in essence, is that gcc analyzes the loop, figures out what it's trying to do, and then uses its own recipe to perform that action in a manner that may or may not be any better than what the programmer was trying to do in the first place.
In my program I have a great presence of the operation n % 10. I know that the module operation can be done much faster when we have n% m where m is the power of 2, since it can be replaced by n & (m-1 ), however Is there any faster way to calculate modulus if the operand is 10?
In my case n is a uint8_t in some cases and in other cases n is an uint32_t.
Because most modern processors can do multiplication much, much faster than division, it is often possible to speed up division and modulus operations where the dividend is a known small constant by replacing the division with one or two multiplications and a few other fast operations (such as shift and addition).
To do so requires computing at compile-time some magic numbers dependent on the dividend; fortunately most modern compilers know how to do this so you don't need to do anything to take advantage. Just let your compiler do the heavy lifting for you, as #chux suggests in an excellent answer.
You can help the compiler by using unsigned types; for some dividends, signed division and modulus are harder to replace.
The basic outline of the optimisation of modulus looks like this:
If you had exact arithmetic, you could replace x % p with p * ((x * (1/p)) % 1). For constant p, 1/p can be precomputed at compile time. The %1 operation simply consists of discarding the fraction part, which is just a right-shift. So that replaces a division with two multiplies, and if p only has a few bits set, the multiply by p might be further optimised into a few left-shifts.
We can do that computation with fixed-point arithmetic, taking advantage of the fact that most processors produce a double-sized result for integer multiplication. Since we don't care about the integer part of the inner multiplication and we know that the result of the outer multiplication must be less than p, we only need to reserve ceil(log2 p) bits for the integer part of the computation, leaving the rest of the bits for the fraction. And that might give us enough precision to correctly handle the possible range of values of x, particularly if x has a limited range (eg. uint8_t or even uint16_t). The key is finding a position of the fixed-point which minimises the error in representation of 1/p.
For many small values of p, that works. For others, there is an alternative (but slower) solution which involves estimating q = x/p using multiplication by the inverse, and then computing x - q * p. If the estimate of q can be guaranteed to be either correct or off by one in a known direction, we only need to correct the final computation by conditionally adding or subtracting p; that can be accomplished without a branch on many modern CPUs. (The direction of the error is known because it will depend only on whether the approximation we chose for the inverse of the dividend was too small or too big.)
In the very specific case of x % 10 where x is a uint_8, you might be able to do better than the above using a 256-byte lookup table. That would only be worthwhile if you were doing the modulus operation in a tight loop over a large number of values, and even then you'd want to profile carefully to verify that it is an improvement.
I doubt whether that's the best expenditure of your time; there are probably much more fruitful optimisation opportunities in your application.
however Is there any faster way to calculate modulus if the operand is 10?
With a good compiler, no. The compiler would have already emitted good code. You can explore different optimization settings with the compiler.
OTOH, if you know of some restrictions that the compiler cannot assume with n % 10, like values are always positive or of a sub-range, you might be able to out optimize the compiler.
Such micro-optimisation is usually not efficient use of programmer's time.
Due to the advances in x86 C compilers (namely GCC and Clang), many coding practices that were believed to improve efficiency are no longer used since the compilers can do a better job optimizing the code than humans (e.g. bit shift vs. multiplication).
Which specific practices are these?
Of the optimizations which are commonly recommended, a couple which are basically never fruitful given modern compilers include:
Mathematical transformations
Modern compilers understand mathematics, and will perform transformations on mathematical expressions where appropriate.
Optimizations such as conversion of multiplication to addition, or constant multiplication or division to bit shifting, are already performed by modern compilers, even at low optimization levels. Examples of these optimizations include:
x * 2 -> x + x
x * 2 -> x << 1
Note that some specific cases may differ. For instance, x >> 1 is not the same as x / 2; it is not appropriate to substitute one for the other!
Additionally, many of these suggested optimizations aren't actually any faster than the code they replace.
Stupid code tricks
I'm not even sure what to call this, but tricks like XOR swapping (a ^= b; b ^= a; a ^= b;) are not optimizations at all. They're just party tricks — they are slower, and more fragile, than the obvious approach. Don't use them.
The register keyword
This keyword is ignored by many modern compilers, as it its intended meaning (forcing a variable to be stored in a register) is not meaningful given current register allocation algorithms.
Code transformations
Compilers will automatically perform a wide variety of code transformations where appropriate. Several such transformations which are frequently recommended for manual application, but which are rarely useful when applied thus, include:
Loop unrolling. (This is often actually harmful when applied indiscriminately, as it bloats code size.)
Function inlining. (Tag a function as static, and it'll usually be inlined where appropriate when optimization is enabled.)
One such practice is to avoid multiplications by using arrays of array pointers instead of real 2D arrays.
Old practice:
int width = 1234, height = 5678;
int* buffer = malloc(width*height*sizeof(*buffer));
int** image = malloc(height*sizeof(*image));
for(int i = height; i--; ) image[i] = &buffer[i*width];
//Now do some heavy computations with image[y][x].
This used to be faster, because multiplications used to be very expensive (on the order of 30 CPU cycles), while memory accesses were virtually free (it was only in the 1990s that caches were added because memory couldn't keep up with full CPU speed).
But multiplications became fast, some CPUs being able to do them in one CPU cycle, while memory accesses did not keep pace at all. So, now this code is likely to be more performant:
int width = 1234, height = 5678;
int (*image)[width] = malloc(height*sizeof(*image));
//Now do some heavy computations with image[y][x],
//which will invoke pointer arithmetic to calculate the offset as (y*width + x)*sizeof(int).
Currently, there are still some CPUs around, where the second code is not faster, but the big multiplication penalty is not with us anymore.
Due to the plurality of platforms you would at best optimize for a given platform (or CPU Architecture/Model) and compiler!! If your code is running on many platforms it's a waste of time. (I'm talking about micro opts, it's always worth considering better algorithms)
This said optimizing for a given platform, DSP makes sense if the need for it arises. Then the best first helper is IMHO the judicious use of restrict keyword if the compiler/optimizer supports it well. Avoid algorithms involving conditions and jumpy code (breaks, goto, if, while, ...) This favors streaming and avoids too many bad branch predictions. I would agree these hints are common sense now.
Generally speaking I would say: Any manipulation that modifies the code by making assumptions on how the compiler optimizes shall be IMHO avoided at all.
Rather, then switch to assembly (common practice for some really important algorithms in DSPs where the compilers while being really great still miss the last few % of CPU/Mem cycles performance increase...)
One optimization that really shouldn't be used much more is #define (expanding on duskwuff's answer a bit).
The C preprocessor is a wonderful thing, and it can do some amazing code transformations, and it can make certain really complex code much simpler — but using #define just to cause a small operation to be inlined isn't usually appropriate anymore. Most modern compilers have a real inline keyword (or equivalent, like __inline__), and they're smart enough to inline most static functions anyway, which means that code like this:
#define sum(x, y) ((x) + (y))
is really better written as the equivalent function:
static int sum(int x, int y)
{
return x + y;
}
You avoid dangerous multiple-evaluation problems and side-effects, you get compiler type-checking and you end up with cleaner code, too. If it's worth inlining, the compiler will do it.
In general, save the preprocessor for the circumstances where it's needed: Emitting a lot of complex, variant code or partial code quickly. Using the preprocessor for inlining small functions and defining constants is mostly an antipattern now.
While I was reading tips in C, I have seen this tip here http://www.cprogramming.com/tips/tip/multiply-rather-than-divide
but I am not sure. I was told both multiply and divide are slower and time consuming and requires many cycles.
and I have seen people often use i << 2 instead of i x 4 since shifting is faster.
Is it a good tip using x0.5 or /2 ? or however modern compilers do optimize it in a better way?
It's true that some (if not most) processors can multiply faster than performing a division operation, but, it's like the myth of ++i being faster than i++ in a for loop. Yes, it once was, but nowadays, compilers are smart enough to optimize all those things for you, so you should not care about this anymore.
And about bit-shifting, it once was faster to shift << 2 than to multiply by 4, but those days are over as most processors can multiply in one clock cycle, just like a shift operation.
A great example of this was the calculation of the pixel address in VGA 320x240 mode. They all did this:
address = x + (y << 8) + (y << 6)
to multiply y with 320. On modern processors, this can be slower than just doing:
address = x + y * 320;
So, just write what you think and the compiler will do the rest :)
I find that this service is invaluable for testing this sort of stuff:
http://gcc.godbolt.org/
Just look at the final assembly. 99% of the time, you will see that the compiler optimises it all to the same code anyway. Don't waste the brain power!
In some cases, it is better to write it explicitly. For example, 2^n (where n is a positive integer) could be written as (int) pow( 2.0, n ) but it is obviously better to use 1<<n (and the compiler won't make that optimisation for you). So it can be worth keeping these things in the back of your mind. As with anything though, don't optimise prematurely.
"multiply by 0.5 rather than divide by 2" (2.0) is faster on fewer environments these days than before, primarily due to improved compilers that will optimize the code.
"use i << 2 instead of i x 4" is faster in fewer environments for similar reasons.
In select cases, the programmer still needs to attend to such issues, but it is increasingly rare. Code maintenance continues to grow as a dominate issue. So use what makes the most sense for that code snippet: x*0.5, x/2.0, half(x), etc.
Compilers readily optimize code. Recommend you code with high level issues in mind. E. g. Is the algorithm O(n) or O(n*n)?
The important thought to pass on is that best code design practices evolve and variations occur amongst environments. Be adaptable. What is best today may shift (or multiply) in the future.
Many CPUs can perform multiplication in 1 or 2 clock cycles but division always takes longer (although FP division is sometimes faster than integer division).
If you look at this answer How can I compare the performance of log() and fp division in C++? you will see that division can exceed 24 cycles.
Why does division take so much longer than multiplication? If you remember back to grade school, you may recall that multiplication can essentially be performed with many simultaneous additions. Division requires iterative subtraction that cannot be performed simultaneously so it takes longer. In fact, some FP units speed up division by performing a reciprocal approximation and multiplying by that. It isn't quite as accurate but is somewhat faster.
If you are working with integers, and you expect to get an integer as result, it's better to use / 2, this way avoids unnecesary conversions to/from float
Are the old tricks (lookup table, approx functions) for creating faster implementations of sqrt() still useful, or is the default implementation as fast as it is going to get with modern compilers and hardware?
Rule 1: profile before optimizing
Before investing any effort in the belief that you can beat the optimizer, you must profile everything and discover where the bottleneck really lies. In general, it is unlikely that sqrt() itself is your bottleneck.
Rule 2: replace the algorithm before replacing a standard function
Even if sqrt() is the bottleneck, then it is still reasonably likely that there are algorithmic approaches (such as sorting distances by length squared which is easily computed without a call to any math function) that can eliminate the need to call sqrt() in the first place.
What the compiler does for you if you do nothing else
Many modern C compilers are willing to inline CRT functions at higher optimization levels, making the natural expression including calls to sqrt() as fast as it needs to be.
In particular, I checked MinGW gcc v3.4.5 and it replaced a call to sqrt() with inline code that shuffled the FPU state and at the core used the FSQRT instruction. Thanks to the way that the C standard interacts with IEEE 754 floating point, it did have to follow the FSQRT with some code to check for exceptional conditions and a call to the real sqrt() function from the runtime library so that floating point exceptions can be handled by the library as required by the standard.
With sqrt() inline and used in the context of a larger all double expression, the result is as efficient as possible given the constraints of of standards compliance and preservation of full precision.
For this (very common) combination of compiler and target platform and given no knowledge of the use case, this result is pretty good, and the code is clear and maintainable.
In practice, any tricks will make the code less clear, and likely less maintainable. After all, would you rather maintain (-b + sqrt(b*b - 4.*a*c)) / (2*a) or an opaque block of inline assembly and tables?
Also, in practice, you can generally count on the compiler and library authors to take good advantage of your platform's capabilities, and usually to know more than you do about the subtleties of optimizations.
However, on rare occasions, it is possible to do better.
One such occasion is in calculations where you know how much precision you really need and also know that you aren't depending on the the C standard's floating point exception handling and can get along with what the hardware platform supplies instead.
Edit: I rearranged the text a bit to put emphasis on profiling and algorithms as suggested by Jonathan Leffler in comments. Thanks, Jonathan.
Edit2: Fixed precedence typo in the quadratic example spotted by kmm's sharp eyes.
Sqrt is basically unchanged on most systems. It's a relatively slow operation, but the total system speeds have improved, so it may not be worth trying to use "tricks".
The decision to optimize it with approximations for the (minor) gains this can achieve are really up to you. Modern hardware has eliminated some of the need for these types of sacrifices (speed vs. precision), but in certain situations, this is still valuable.
I'd use profiling to determine whether this is "still useful".
If you have proven that the call to sqrt() in your code is a bottleneck with a profiler then it may be worth trying to create an optimizated version. Otherwise it's a waste of time.
This probably is the fastest method of computing the square root:
float fastsqrt(float val) {
union
{
int tmp;
float val;
} u;
u.val = val;
u.tmp -= 1<<23; /* Remove last bit so 1.0 gives 1.0 */
/* tmp is now an approximation to logbase2(val) */
u.tmp >>= 1; /* divide by 2 */
u.tmp += 1<<29; /* add 64 to exponent: (e+127)/2 =(e/2)+63, */
/* that represents (e/2)-64 but we want e/2 */
return u.val;
}
wikipedia article
This probably is the fastest method of computing the inverse square root. Assume at most 0.00175228 error.
float InvSqrt (float x)
{
float xhalf = 0.5f*x;
int i = *(int*)&x;
i = 0x5f3759df - (i>>1);
x = *(float*)&i;
return x*(1.5f - xhalf*x*x);
}
This is (very roughly) about 4 times faster than (float)(1.0/sqrt(x))
wikipedia article
It is generally safe to assume that the standard library developers are quite clever, and have written performant code. You're unlikely to be able to match them in general.
So the question becomes, do you know something that'll let you do a better job? I'm not asking about special algorithms for computing the square root (the standard library developers knows of these too, and if they were worthwhile in general, they'd have used them already), but do you have any specific information about your use case, that changes the situation?
Do you only need limited precision? If so, you can speed it up compared to the standard library version, which has to be accurate.
Or do you know that your application will always run on a specific type of CPU? Then you can look at how efficient that CPU's sqrt instruction is, and see if there are better alternatives. Of course, the downside to this is that if I run your app on another CPU, your code might turn out slower than the standard sqrt().
Can you make assumptions in your code, that the standard library developers couldn't?
You're unlikely to be able to come up with a better solution to the problem "implement an efficient replacement for the standard library sqrt".
But you might be able to come up with a solution to the problem "implement an efficient square root function for this specific situation".
Why not? You probably learn a lot!
I find it very hard to believe that the sqrt function is your application's bottleneck because of the way modern computers are designed. Assuming this isn't a question in reference to some crazy low end processor, you take a tremendous speed hit to access memory outside of your CPU caches, so unless you're algorithm is doing math on a very few numbers (enough that they all basically fit within the L1 and L2 caches) you're not going to notice any speed up from optimizing any of your arithmetic.
I still find it useful even now, though this is the context of normalizing a million+ vectors every frame in response to deforming meshes.
That said, I'm generally not creating my own optimizations but relying on a crude approximation of inverse square root provided as a SIMD instruction: rsqrtps. That is still really useful in speeding up some real-world cases if you're willing to sacrifice precision for speed. Using rsqrtps can actually reduce the entirety of the operation which includes deforming and normalizing vertex normals to almost half the time, but at the cost of the precision of the results (that said, in ways that can barely be noticed by the human eye).
I've also still found the fast inverse sqrt as often credited incorrectly to John Carmack to still improve performance in scalar cases, though I don't use it much nowadays. It's generally natural to get some speed boost if you're willing to sacrifice accuracy. That said, I wouldn't even attempt to beat C's sqrt if you aren't trying to sacrifice precision for speed.
You generally have to sacrifice the generality of the solution (like its precision) if you want to beat standard implementations, and that tends to apply whether it's a mathematical function or, say, malloc. I can easily beat malloc with a narrowly-applicable free list lacking thread-safety that's suitable for very specific contexts. It's another thing to beat it with a general-purpose allocator which can allocate variable-sized chunks of memory and free any one of them at any given time.