Assume that you have chosen the most efficient algorithm for solving a problem where performance is the first priority, and now that you're implementing it you have to decide about details like this:
v[i*3+0], v[i*3+1] and v[i*3+2] contain the components of the velocity of particle i and we want to calculate the total kinetic energy. Given that all particles are of the same mass, one may write:
inline double sqr(double x)
{
return x*x;
}
double get_kinetic_energy(double v[], int n)
{
double sum = 0.0;
for (int i=0; i < n; i++)
sum += sqr(v[i*3+0]) + sqr(v[i*3+1]) + sqr(v[i*3+2]);
return 0.5 * mass * sum;
}
To reduce the number of multiplications, it can be written as:
double get_kinetic_energy(double v[], int n)
{
double sum = 0.0;
for (int i=0; i < n; i++)
{
double *w = v + i*3;
sum += sqr(w[0]) + sqr(w[1]) + sqr(w[2]);
}
return 0.5 * mass * sum;
}
(one may write a function with even fewer multiplications, but that's not the point of this question)
Now my question is: Since many C compilers can do this kind of optimizations automatically, where should the developer rely on the compiler and where should she/he try to do some optimization manually?
where should the developer rely on the compiler and where should she/he try to do some optimization manually?
Do I have fairly in-depth knowledge of the target hardware as well as how C code translates to assembler? If no, forget about manual optimizations.
Are there any obvious bottlenecks in this code - how do I know that it needs optimization in the first place? Obvious culprits are I/O, complex loops, busy-wait loops, naive algorithms etc.
When I found this bottleneck, how exactly did I benchmark it and am I certain that the problem doesn't lie in the benchmarking method itself? Experience from SO shows that some 9 out of 10 strange performance questions can be explained by incorrect benchmarking. Including: benchmarking with compiler optimizations disabled...
From there on you can start looking at system-specific things as well as the algorithms themselves - there's far too many things to look at to cover in an SO answer. It's a huge difference between optimizing code for a low-end microcontroller and a 64-bit desktop PC (and everything in between).
One thing that looks a bit like premature optimization, but could just be ignorance of language abilities is that you have all of the information to describe particles flattened into an array of double values.
I would suggest instead that you break this down, making your code easier to read by creating a struct to hold the three datapoints on each particle. At that point you can create functions which take a single particle or multiple particles and do computations on them.
This will be much easier for you than having to pass three times the number of particles arguments to functions, or trying to "slice" the array. If it's easier for you to reason about, you're less likely to generate warnings/errors.
Looking at how both gcc and clang handle your code, the micro optimisation you contemplate is vain. The compilers already apply standard common subexpression elimination techniques that remove to overhead you are trying to eliminate.
As a matter of fact, the code generated handles 2 components at a time using XMM registers.
If performance is a must, then here are steps that will save the day:
the real judge is the wall clock. Write a benchmark with realistic data and measure performance until you get consistent results.
if you have a profiler, use it to determine where are the bottlenecks if any. Changing algorithms for those parts that appear to hog performance is an effective approach.
try and get the best from the compiler: study the optimization options and try and let the compiler use more aggressive techniques if they are appropriate for the target system. For example -mavx512f -mavx512cd let the gcc generate code that handles 8 components at a time using the 512-bit ZMM registers.
This is a non intrusive technique as the code does not change, so you don't risk introducing new bugs by hand optimizing the code.
Optimisation is a difficult art. In my experience, simplifying the code gets better results and far fewer bugs than adding extra subtle stuff to try and improve performance at the cost of readability and correctness.
Looking at the code, an obvious simplification seems to generate the same results and might facilitate the optimizer's job (but again, let the wall clock be the judge):
double get_kinetic_energy(const double v[], int n, double mass)
{
double sum = 0.0;
for (int i = 0; i < 3 * n; i++)
sum += v[i] * v[i];
return 0.5 * mass * sum;
}
Compilers like clang and gcc are simultaneously far more capable and far less capable than a lot of people give them credit for.
They have an exceptionally wide range of patterns where they can transform code into an alternative form which is likely to be more efficient and still behave as required.
Neither, however, is especially good at judging when optimizations will be useful. Both are prone to making some "optimization" decisions that are almost comically absurd.
For example, given
void test(char *p)
{
char *e = p+5;
do
{
p[0] = p[1];
p++;
}while(p < e);
}
when targeting the Cortex-M0 with an optimization level below -O2, gcc 10.2.1 will generate code equivalent to calling memmove(p, p+1, 7);. While it would be theoretically possible that a library implementation of memmove might optimize the n==7 case in such a way as to outperform the five-instruction byte-based loop generated at -Og (or even -O0), it would seem far more likely that any plausible implementations would spend some time analyzing what needs to be done, and then after doing that spend just as long executing the loop as would code generated using -O0.
What happens, in essence, is that gcc analyzes the loop, figures out what it's trying to do, and then uses its own recipe to perform that action in a manner that may or may not be any better than what the programmer was trying to do in the first place.
Related
I am using the following code to multiply matrices:
cblas_sgemv(CblasRowMajor, CblasNoTrans, n, n, 1, (float *)A, n, B, 1, 1.0f, C, 1);
Where A is a n x n matrix, and B is n x 1 matrix.
The alternative is to do it the usual way -
for (k = 0; k < n; k++)
for (i = 0; i < n; i++)
C[i] += A[i * n+ k] * B[k];
Surprisingly, the Blas implementation is taking more time than the for loop version. What could be the reason for that?
If we look at the documentation for that function here:
https://developer.apple.com/documentation/accelerate/1513065-cblas_sgemv
... then it's a pretty complex function able to handle different memory layouts, apply a scaling factor, optionally transpose, use an arbitrary stride across rows and columns that probably isn't subject to constant folding if calling the function requires going across module boundaries, etc.
So that's a whole lot more to do, and with the compiler having much less compile-time information to optimize against, than your simple loop version. Where I think it could have an edge in performance is if your matrices are extremely large. Then the BLAS implementation might be able to manually use SIMD in ways that beat your most aggressive optimizers, parallelize the loop, use loop tiling, etc. But those methods usually only provide improvements to justify their overhead when used against especially large matrices, at which point the extra overhead to handle all those extra parameters and the cost of the indirect function call would also be trivialized.
If n in your example is sufficiently large (say 1000+), then I would be slightly more surprised that your simple loopy version is beating it, but it still doesn't seem like a huge surprise since that's a pretty complex function that involves a lot of runtime overhead that can't be optimized away (given that it's a dylib API from what I can tell) with all the possible parameters you can specify. If the library is decent, then I suspect it will begin to beat your simple scalar code at some threshold for n, but that might require n to be quite large, especially given that our optimizers are getting better and better these days at vectorizing our scalar logic.
I'm not familiar with this library but browsing over its API, it's quite generalized in nature. Typically if you want to get the most out of SIMD, you have to organize your data in a certain way suited for your application requirements. For example, if you can represent matrices and vectors in SoA form, then I've found that I can get close to the theoretical boosts in data consumption that SIMD offers (ex: close to 4x with 128-bit registers for SPFP and 8x with 256-bit) and beat my optimizers. But in AoS form doing a single matrix/vector multiplication at a time, I find the speedups far more negligible (say 25% or less), or sometimes even slower than straightforward loops involving scalar instructions that I leave up to my optimizers to vectorize away.
What I would typically expect as far as API design in a library that offers the fastest matrix/vector multiplication is for the multiply function to input an entire container/array of vectors (multiple vectors at once, i.e., against a single matrix). That would generally give it the maximum breathing room to most effectively use parallelization as well as trivializing the overhead of the function call (ex: the branching overhead to determine if scale factors are ~1.0f to determine whether or not an additional multiplication is needed per iteration), and that should have a higher probability of performing well even if your matrices/vectors are on the smaller side provided that you pass in many small vectors at once. There might be a way to do that with your library. If so, I'd give that a shot.
optimizeMe (const char* string0, const char* string1)
{
int i0;
int i1 = strlen(string1) - 1;
int count = 0;
for (i0 = 0; i0 < strlen(string0); i0++)
{
if (toupper(string0[i0]) == toupper(string1[i1]))
count++;
count++;
if ((i0%32)==0)
i1--;
}
return(count / 8);
}
I know I can optimize this code by using register, gcc -o2, reduction in strength i0%32=0x10000, and common expression count/8 = count >> 3, etc;
However, how can I optimize them by code motion? Specifically for if statement and il--.
Any hints are appreciated !
As Lundin pointed out in the comments, these are premature optimisations. They're premature because you're clearly just guessing, and not using a profiler to test your theories on real-world programs with real-world bottlenecks.
They're also micro-optimisations, and we haven't really needed to micro-optimise in C since the 80s, thanks to significant breakthroughs in technology that allow us to do amazing things like mocking up three-dimensional realms in real-time, for example.
gcc already has support for various feature such as dead code elimination, code hoisting (even into compile-time in some cases) and profile-guided optimisation which you might be interested in. I think we take for granted the fact that we have a compiler which can statically deduce when code is unreachable, all by its lonesome; that's a quite complex optimisation for a machine to perform.
By profiling as you test, and then recompiling, feeding the profiler data back into the compiler, the compiler obtains information about how to rearrange branches to be better predicted. That's profile-guided optimisation, and it's semi-automated. I wonder what the authors of Wolfenstein 3D would've done for this kind of technology...
Speaking of profilers, if I may suggest that you test these in some realistic usecases (i.e. actual programs that have active development and a large community):
using register
reduction in strength i%32=0x10000
count/8 = count >> 3
That last optimisation isn't even correct for you (see exercise 5), by the way... This might be a good time to mention the other debugging tools we have in our suite. I can highly recommend checking out ASan (and the other UBSans) for example, will likely save you hours of debugging one day.
It might be best to use size_t since that's what strlen returns for a start, size_t is more portable for use with strlen and quite possibly faster too (due to the fact that size_t has no sign bit and so no potential sign handling overhead when you write things like for (size_t x = 0; x < y; x++))...
... or, if you want provide to your architecture-specific hints to your compiler (which presumably has no profile-guided optimisation, or else you wouldn't need to manually do that), you could also use uint_fast32_t or something else that isn't really suitable for the task, but is still vastly more suitable than int.
I gather you must be getting input from somewhere, or else your program is "pure", in the (functional) sense that it has no side-effects that change the way the user interacts with the system at all (in that case, your compiler might even hoist all of your logic into compile-time evaluation)... Have you considered adjusting the buffer sizes of whichever files and/or streams (i.e. stdin and stdout) you're using? You ought to be able to use setvbuf to do that... If you have many streams open at once, you might want to try choosing a smaller stream buffer size in order to keep all of your stuff in cache. I like to start off with a buffer size of 1, and work my way up from there, that way you'll see precisely where the bottleneck for your system is... To be clear, if you were to unroll loops (which gcc will happily do automatically if it's beneficial)...
If you're using a really primitive compiler (which you're not, though profilers are honestly likely to guide you straight to the heftiest optimisations in any case, optimising your time as a developer), you might be able to suggest to the compiler to emit non-branching code for these lines:
// consider `count += (toupper((unsigned char) string[i0]) == toupper((unsigned char) string1[i1]));`?
if (toupper(string0[i0]) == toupper(string1[i1]))
count++;
The casts are necessary to prevent crashes in certain circumstances, by the way... you absolutely need to make sure the only values you pass to a <ctype.h> function are unsigned char values, or EOF.
// consider using `i -= !(i0 % 32);`?
if ((i0%32)==0)
i1--;
I'm trying to optimize some C code, and it's my first time.
As a first step i dumped my executable file in order to see the assembler code.
For example for this function:
void init_twiddle(int N)
{
int i=0;
for(i=0; i<ELEMENTS_HALF; i++)
{
twiddle_table[i].re = (float) cos((float)i * 2.0 * PI / (float)N);
twiddle_table[i].im = (float) - sin((float)i * 2.0 * PI / (float)N);
}
}
wouldn't be better if i do this instead:
void init_twiddle(int N)
{
int i=0;
float puls = 2.0 * PI / (float)N;
for(i=0; i<ELEMENTS_HALF; i++)
{
twiddle_table[i].re = (float) cos((float)i * puls);
twiddle_table[i].im = (float) - sin((float)i * puls);
}
in order to avoid mult and div operation of being repeated thousands of times?
}
Unfortunately, your first step was already kindof wrong.
Don't blindly walk through your code optimizing arbitrary loops which might or (more probably) might not affect performance (maybe because that code is so rarely called that it doesn't really use any run-time).
Optimizing means: You need to find out first where is the time spent in my program? Use timing measurements to narrow down where your program spends most of its time (you can use homegrown logging using us timers or a profiling application for that). Without at least some figures you wouldn't even see where the compiler has already helped you and maybe has already maxed out all possibilities, even if your code looks like it has some potential left for being faster (modern compilers are really very good at that).
Only if you know the hot spots in your application you should start optimizing those.
The problem is that it is a floating point expression and floating point operations are not commutative. So the optimization is invalid in general for any compiler that follows IEEE 754. So either you have to do this optimization manually, or you have to tell the compiler to treat floating point as commutative for optimization purposes (in gcc and clang you use -ffast-math to do this). This will introduce slight changes in the resulting values.
For comparison of the assembly:
Without -ffast-math
With -ffast-math
You can do this much faster, indeed you need only 1 sine and 1 cosine (which are disastrously slow). What you're actually doing is calculating the coordinates of a little vector that you spin around the origin, the alternative way to do it is by actually spinning that vector around the origin, one step at the time. The rotation matrix for a single step is what costs the single sine and cosine.
Of course this may be a bit less accurate, but no significant trouble should build up if you make a a reasonable number of steps.
The root of all evil is premature optimization
– Donald Knuth
You should optimize, if you have a problem with execution duration. There are tools that record the duration of every single statement or at least function call.
I think that most compilers detect such constant expressions in a loop and there is nothing to optimize, because it is already optimized.
First of all, use double, not float. In C, library routines are all in double, so you're just doing a lot of converting.
Second, calculate the angle once and put it in a variable, not twice.
Maybe the compiler recognizes that it can do this for you, but I prefer not to tempt the compiler not to.
Third, is there a sincos function? The sine and cosine functions are closely related, so one can be calculated at the same time as the other.
Fourth, when thinking about performance, switch your brain to thinking in percent of total time, not doing something "thousands of times". That way, you will concentrate on what has the greatest overall benefit, not things that might well be irrelevant.
This probably won't change your code performance, since this a standard loop invariants optimization that is performed by any standard compiler (assuming optimizations aren't turned off)..
Optimization in modern compilers is getting better and better with basic optimizations like constant folding to utilizing SIMD instructions. However, I wonder how far these kind of optimizations should be taken and how this decision is made by compilers nowadays.
Let's look at an example:
#include <stdio.h>
static double naive_sin(double n) {
return n - n*n*n / 6.0 + n*n*n*n*n / 120.0 + n*n*n*n*n*n*n / 5040.0;
}
int main() {
printf("%f\n", naive_sin(1.0));
return 0;
}
When compiling this with GCC with -O3, it can be observed that the resulting floating point number is calculated by the compiler and stored in the source code. Optimizing further than that is obviously not possible.
Now, let's look at a second example:
#include <stdio.h>
int main() {
double start = 0.0;
for (int i = 0; i < 100; i++) {
start += 1.0;
}
printf("%f\n", start);
return 0;
}
With the result of the first example in mind, one could expect the compiler to apply similar optimization and produce the constant 100.0 in the resulting machine code. However, when looking at the output, it turns out that the loop is still there!
Obviously this kind of optimization is not always possible. Let's say you were writing a program that calculates pi to a million places. Such a program requires no user input, so theoretically the result could be hardcoded into the machine code by the compiler. Of course this is not a good idea, because the compiler will take much longer to internally evaluate a program like that as opposed to just running the less optimized version.
Still, what makes the compiler decide to not optimize the loop in this case? Are there languages/compilers that optimize this kind of code or is there something preventing this? Is it perhaps related to the concept of not being able to predict if a program is ever going to end?
It's really just a question of which optimizations are enabled, and what optimizations are actually available in your compiler. Some optimizations, like function inlining and constant propagation, are basically universally available and relatively easy to implement. So, the majority of compilers will optimize the first program with most optimization settings.
The second program requires loop analysis and loop elimination to optimize, which is much trickier to do. A compiler probably could optimize the second program, but your compiler most likely doesn't have the mechanisms for optimizing such a loop (proving the correctness of float optimizations is often a lot trickier than proving the correctness of integer optimizations). Note that my version of GCC does optimize the loop if start is declared as an int.
Is this:
int x=0;
for (int i=0;i<100;i++)
x++;
for (int i=0;i<100;i++)
x--;
for (int i=0;i<100;i++)
x++;
return x;
Same as this:
int x=0;
for (int i=0;i<100;i++){
x++;
x--;
x++;
}
return x;
Note: This is just an example, the real loop would be much more complex.
So are these two loops the same or is the second one faster?
EDIT: Java or C++. I was wondering about the both.
I didn't know that compiler would actually optimize the code.
Unoptimized: three loops take longer, since there are three sets of loop opcodes.
Optimized, it depends on the optimizer. A good optimizer might be smart enough to realize that the x++;x--; statements in the single-loop version cancel each other out, and eliminate them. A really smart optimizer might be able to do the same thing with the separate loops. A ridiculously smart optimizer might figure out what the code is doing, and just replace the whole block with return 100; (see added note below)
But the real-world answer for optimization is usually: fuhgeddaboutit. If your code gets its job done correctly, and fast enough to be useful, leave it alone. Only if actual tests show it's too slow should you profile to identify the bottlenecks and replace them with more efficient code. (Or a better algorithm entirely.)
Programmers are expensive, CPU cycles are cheap, and there are plenty of other tasks with bigger payoffs. And more fun to write, too.
about the "ridiculously smart optimizer" bit: the D language offers Compile-Time Function Evaluation. CTFE allows you to use virtually the full capability of the language to compute something at build time, then insert only the computed answer into the runtime code. In other words, you can explicitly turn the entire compiler into your optimizer for selected chunks of code.
If you count each increment, decrement, assignment and comparison as one operation, your first example has some 900 operations, while your second example has ~500. That is, if the code is executed as is and not optimized. It should be obvious which is more performant.
In reality the code may or may not be optimized by a compiler, and different compilers for different languages will do quite a different job at optimization.