Loop vectorization and how to avoid it - loops

Loop vectorization is when all right-hand-side expressions are computed at the onset. I just discovered my loops are being vectorized (in FORTRAN 77... don't ask). I need my loop condition variable to be updated in each iteration, but how can I rewrite to work around this vectorization?
In a related post, I'm looking for a way to disable this optimization "feature" in FORTRAN specifically, but here I am looking for a more algorithmic solution to the general case.

That's not what loop vectorisation means to me. To me the phrase means that the compiler will generate code which can take advantage of any vector computation capabilities of the hardware. On a simple Intel Xeon this might mean generating SSE4 instructions to simultaneously manipulate a few adjacent array elements together, on a Cray there may be much more available in terms of simultaneous execution of the same operation on vector registers.
How do you think that all the RHS expressions are 'computed at the onset' ? I'm not sure what you mean by that. Could you post some code to explain ? If you mean that the number of trips through the loop is computed on entry to the first iteration, then that is correct. That is a very useful feature when it comes to optimising code and not one most Fortran programs would benefit from avoiding.
If you are writing DO loops in Fortran updating the iteration variable is forbidden by the standard, and always has been so far as I recall. Your compiler might let you get away with it but I wouldn't trust a Fortran program in which this happened.

Related

Does gcc always do this kind of optimization? (common subexpression elimination)

As an example, assume that the expression sys->pot.atoms[item->P.kind].mass is evaluated inside a loop. The loop only changes item, so the expression can be simplified as atoms[item->P.kind].mass by defining a variable as atoms = sys->pot.atoms before the loop. Do modern compilers like gcc perform this kind of optimization automatically (if optimization is enabled)? And is it reliable regardless of the number of expressions like atoms[item->P.kind].mass existing inside a loop?
Yes it is a very common optimisation called Loop invariant code motion, also called hoisting or scalar promotion, often performed as a side effect of Common subexpression elimination.
It is valid to compute sys->pot.atoms just once before the loop if the compiler can ascertain that neither sys nor sys->pot.atoms can be modified inside the loop.
Note however, as commented by Groo, that if sys or sys->pot or sys->pot.atoms are specified as volatile, then it would be incorrect to compute it only once if the expression sys->pot.atoms is evaluated multiple times in the loop body or expressions.
It's a very common optimization.
And is it reliable regardless of the number of expressions
No, because optimizations is not something you can rely on happening in general. The C standard says nothing about it, so it's up to the maker of the compiler to give guarantees or not. But that's nothing you really do for the optimizer. The optimizer has a "best effort" approach, and a missed optimization is often treated like a flaw rather than an actual bug.
EDIT:
From discussion in comments, I found it useful to mention that just because a certain optimization was performed, that does not guarantee faster code. For instance, the benefit of loop unrolling is that the test in the loop does not need to be performed every iteration. But on the other hand, longer code can be less cache friendly. So asking if it's guaranteed that a certain optimization is performed or not does not really give any useful information.
I always wonder where I should do optimization myself, and where I should sit relax and leave it to the compiler.
That's very hard to know in advance. Guys like Linus Torvalds can basically see the assembly code in their head just by watching the C code, but for us mere mortals, it comes down to benchmarking and profiling.
Before even considering micro optimizations, perform these checks
Make sure that the code you're about to optimize actually is a bottleneck
Make sure you're using a good algorithm
Make sure the code is cache friendly

Does Intel MKL or some similar library provide a vectorized way to count the number of elements in an array fulfilling some condition in C?

The problem
I'm working on implementing and refining an optimization algorithm with some fairly large arrays (from tens of millions of floats and up) and using mainly Intel MKL in C (not C++, at least not so far) to squeeze out every possible bit of performance. Now I've run into a silly problem - I have a parameter that sets maxima and minima for subsets of a set of (tens of millions) of coefficients. Actually applying these maxima and minima using MKL functions is easy - I can create equally-sized vectors with the limits for every element and use V?Fmax and V?Fmin to apply them. But I also need to account for this clipping in my error metric, which requires me to count the number of elements that fall outside these constraints.
However, I can't find an MKL function that allows me to do things like counting the number of elements that fulfill some condition, the way you can create and sum logical arrays with e.g. NumPy in Python or in MATLAB. Irritatingly, when I try to google this question, I only get answers relating to Python and R.
Obviously I can just write a loop that increments a counter for each element that fulfills one of the conditions, but if there is an already optimized implementation that allows me to achieve this, I would much prefer that just owing to the size of my arrays.
Does anyone know of a clever way to achieve this robustly and very efficiently using Intel MKL (maybe with the statistics toolbox or some creative use of elementary functions?), a similarly optimized library that does this, or a highly optimized way to hand-code this? I've been racking my brain trying to come up with some out-of-the box method, but I'm coming up empty.
Note that it's necessary for me to be able to do this in C, that it's not viable for me to shift this task to my Python frontend, and that it is indeed necessary for me to code this particular subprogram in C in the first place.
Thanks!
If you were using c++, count_if from the algorithms library with an execution policy of par_unseq may parallelize and vectorize the count. On Linux at least, it typically uses Intel TBB to do this.
It's not likely to be as easy in c. Because c doesn't have concepts like templates, callables or lambdas, the only way to specialize a generic (library-provided) count()-function would be to pass a function pointer as a callback (like qsort() does). Unless the compiler manages to devirtualize and inline the callback, you can't vectorize at all, leaving you with (possibly thread parallelized) scalar code. OTOH, if you use for example gcc vector intrinsics (my favourite!), you get vectorization but not parallelization. You could try to combine the approaches, but I'd say get over yourself and use c++.
However, if you only need vectorization, you can almost certainly just write sequential code and have the compiler autovectorize, unless the predicate for what should be counted is poorly written, or your compiler is braindamaged.
For example. gcc vectorizes the code on x86 if at least sse4 instructions are available (-msse4). With AVX[2/512] (-mavx / -mavx2 / -mavx512f) you can get wider vectors to do more elements at once. In general, if you're compiling on the same hardware you will be running the program on, I'd recommend letting gcc autodetect the optimal instruction set extensions (-march=native).
Note that in the provided code, the conditions should not use short-circuiting or (||), because then the read from the max-vector is semantically forbidden if the comparison with the min-vector was already true for the current element, severely hindering vectorization (though avx512 could potentially vectorize this with somewhat catastrophic slowdown).
I'm pretty sure gcc is not nearly optimal in the code it generates for avx512, since it could do the k-reg (mask register) or in the mask registers with kor[b/w/d/q], but maybe somebody with more experience in avx512 (*cougth* Peter Cordes *cough*) could weigh in on that.
MKL doesn't provide such functions but You may try to check another performance library - IPP which contains a set of threshold functions that could be useful to your case. Please refer to the IPP Developer Reference to check more details - https://software.intel.com/content/www/us/en/develop/documentation/ipp-dev-reference/top/volume-1-signal-and-data-processing/essential-functions/conversion-functions/threshold.html

Compiler Hints and Semantics for Optimizations [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I spent the last couple of weeks optimizing a numerical algorithm. Through a combination of precomputation, memory alignment, compiler hints and flags, and trial and error experimentation, I brought the run-time down over an order of magnitude. I have not yet explicitly vectorized using intrinsics or used multi-threading.
Frequently when working on this type of problem, there is an initialization routine, after which, many parameters become constant. These might be filter lengths, the expression of a switch statement, for loop length or iteration increment. If the parameters were know at compile time, the compiler should be able to do a much more effective job of optimization by knowing exactly how to unroll loops, replace index calculations with instructions that have the offset encoded in the instruction, simplify or eliminate expressions at compile time, possibly eliminate switch statements, etc. The most extreme way of dealing with this problem would be to run the initialization routine (at run-time), then run the compiler on the critical function to be optimized using some kind of plugin that allows iteration over the abstract syntax tree, replace the parameters with constants, and finally dynamically link to the shared object. If the routine is short, it could be dynamically compiled inside the binary using a number of tools.
More practically, I rely very heavily on alignment, gcc __builtin_assume_aligned, restrict, manual loop unrolling, and compiler flags to get the compiler to do what I want given the unknown value of parameters at compile time. I'm wondering what other options are available to me that are at least close to portable. I only use intrinsics as a last resort since it's not portable and a lot of work. Specifically, how can I provide the compiler (gcc) with additional information concerning loop variables using either language semantics, compiler extensions, or external tools so it can do a better job of doing optimizations for me. Similarly is there any way to qualify variables as having a stride so that loads and stores are always aligned, thus more easily enabling the auto-vectorization and loop unrolling process.
These issues come up frequently, so I am hoping there is some more elegant method of solving them. What follows are examples of the kind of problems I hand optimize but I believe the compiler ought to be able to do for me. These are not intended to be further questions.
Sometimes you have a filter, the length of which is not a multiple of the length of the longest SIMD register, and there may be memory alignment issues as well. In this case case I either (A) unroll the loop by a multiple of the vector register and call into the unoptimized code for the epilogue/prologue or (B) pad the start or end of the filter with zeros. I've recently learned gcc and other compilers have the ability to peel loops. From the limited documentation I've been able to find, I believe the finest grain control you have over peeling is over entire functions (rather than individual loops) using compiler directives. Further, there are some parameters you can provide, but it's mostly just an upper or lower bound on the amount of unrolling or number of instructions produced.
In order to really know the best method of unrolling/peeling or zero padding, the compiler needs to know something about the length of the loop and/or the size of the increment. For example, it would be very helpful to know that a loop is likely to have a length greater than a million or less than 100. It would be helpful to know that the loop will always run either 32 or 34 times. In fact, since the compiler knows much more about the computer architecture than I do, it would be much better if it made all the unrolling decisions based on information I provide about the loop variables. I had a situation where I wanted the compiler to unroll a loop. I specifically gave it the #pragma GCC optimize ("unroll-loops") directive. However, what it required to work was also the statement N &= ~7, thus informing the compiler that the loop length was a multiple of 8. This is not a semantic feature of the language, and it does not have the effect of changing the value of N. It was strictly to inform the static analyzer of the compiler that the loop was already a multiple of the length of the AVX register. In this case I was lucky and it worked because gcc is very clever. But in other cases, my hints don't seem to work (or they do, but there is no compiler feedback to let me know the additional information was of no value). In one case I had to explicitly tell the compiler not to unroll the loop because the outer loop was very short and the overhead was not worth it. With the optimizer on the maximum setting, often the only way to know what's going on is to look at the assembly listing, make some changes, and try again.
In another situation I carefully unrolled a loop so the compiler would use the AVX registers. The manual unrolling was probably necessary as the compiler doesn't have sufficient information about the length of the loop or that the length was of a particular multiple. Unfortunately, the inner loop was accessing an unaligned array of floats of length four per group (16 byte alignment). The compiler was using only the legacy 128 bit XMM registers. After making a weak attempt to vectorize using AVX intrinsics, I discovered the extra overhead of the unaligned access made the performance no better than what gcc was doing. So I thought, I could align each group of floats on the start of a cache line and use a stride equal to the cache length (or half, which is the length of and AVX register) to eliminate the alignment problem. However, this may turn out to be ineffective due to the extra memory bandwidth. It's certainly more work on my part. It makes the code harder to understand. And, at the very least, getting the stride right would depend on compile time constants I would need to supply. I wonder if there is some simpler way to do this relying on the compiler to do all the work? I would be willing to try it if it meant only changing a line or two of code. It's not worth it if I have to do it manually (in this case anyway). (Thinking about it as I write this, I may be able to use a union or struct with 48 bytes of padding and and a few extra lines of code. I would have to give that some thought...)

Any way to vectorize in C

My question may seem primitive or dumb because, I've just switched to C.
I have been working with MATLAB for several years and I've learned that any computation should be vectorized in MATLAB and I should avoid any for loop to get an acceptable performance.
It seems that if I want to add two vectors, or multiply matrices, or do any other matrix computation, I should use a for loop.
It is appreciated if you let me know whether or not there is any way to do the computations in a vectorized sense, e.g. reading all elements of a vector using only one command and adding those elements to another vector using one command.
Thanks
MATLAB suggests you to avoid any for loop because most of the operations available on vectors and matrices are already implements in its API and ready to be used. They are probably optimized and they work directly on underlying data instead that working at MATLAB language level, a sort of opaque implementation I guess.
Even MATLAB uses for loops underneath to implement most of its magic (or delegates them to highly specialized assembly instructions or through CUDA to the GPU).
What you are asking is not directly possible, you will need to use loops to work on vectors and matrices, actually you would search for a library which allows you to do most of the work without directly using a for loop but by using functions already defined that wraps them.
As it was mentioned, it is not possible to hide the for loops. However, I doubt that the code MATLAB produces is in any way faster the the one produced by C. If you compile your C code with the -O3 it will try to use every hardware feature your computer has available, such as SIMD extensions and multiple issue. Moreover, if your code is good and it doesn't cause too many pipeline stalls and you use the cache, it will be really fast.
But i think what you are looking for are some libraries, search google for LAPACK or BLAS, they might be what you are looking for.
In C there is no way to perform operations in a vectorized way. You can use structures and functions to abstract away the details of operations but in the end you will always be using fors to process your data.
As for speed C is a compiled language and you will not get a performance hit from using for loops in C. C has the benefit (compared to MATLAB) that it does not hide anything from you, so you can always see where your time is being used. On the downside you will notice that things that MATLAB makes trivial (svd,cholesky,inv,cond,imread,etc) are challenging in C.

Fortran forall restrictions

I tried to use forall to allocate dynamic arrays, but gfortran didn't like that. I also found out that write statements are forbidden in a forall block ,and I suspect read statements are too.
What other functions/operations are not permitted in a forall block?
Exactly what is this construct for, besides sometimes replacing do loops when order doesn't matter? I thought it would make coding more legible and elegant, especially showing when the order of operations are not important, but it seems quite restrictive with what operations can be done inside a forall.
What are the reasons for these restrictions, i.e. what do they protect/prevent the user from messing up? Is it a good idea to use forall? If so, for what purposes?
Right now in the code I'm working on there is only one forall block, and if I translated it all out in do loops it would give four nested loops. Which way is better?
There is not much need for FORALL and WHERE constructs nowadays. They were introduced as part of Fortran 95 (minor extension to Fortran 90), mostly for the purpose of optimization, when code vectorization was a major thing in HPC. The reason that FORALL is so limited in application is exactly because it was designed for loop optimization. Also note that, FORALL is not a looping construct, but assignment. Thus, only assignment statements are allowed inside the block. In theory, DO loops give explicit instructions about the order of indices that the processor is going to loop over. A FORALL construct allows the compiler to choose the most optimal order based on how the array is stored in memory. However, this has lost meaning over time, since modern compilers are very good at DO loop vectorizations and you are not likely to notice any improvement by using FORALL.
See a nice discussion on FORALL and WHERE here
If you are worried about code performance, you may rather want to consider a different compiler - PGI or ifort. From my own experience, gfortran is suitable for development, but not really for HPC. You will notice up to several times faster execution with code compiled with pgf90 or ifort.
Forall construct proved to be really too restrictive and is mostly useful only for array operations. For exact limitations see IBM Fortran - FORALL. Less restrictive is a do concurrent construct of Fortran 2008. Even read and write statements are allowed there. See Intel Fortran - DO CONCURRENT and New features of Fortran 2008.

Resources