Fortran forall restrictions - loops

I tried to use forall to allocate dynamic arrays, but gfortran didn't like that. I also found out that write statements are forbidden in a forall block ,and I suspect read statements are too.
What other functions/operations are not permitted in a forall block?
Exactly what is this construct for, besides sometimes replacing do loops when order doesn't matter? I thought it would make coding more legible and elegant, especially showing when the order of operations are not important, but it seems quite restrictive with what operations can be done inside a forall.
What are the reasons for these restrictions, i.e. what do they protect/prevent the user from messing up? Is it a good idea to use forall? If so, for what purposes?
Right now in the code I'm working on there is only one forall block, and if I translated it all out in do loops it would give four nested loops. Which way is better?

There is not much need for FORALL and WHERE constructs nowadays. They were introduced as part of Fortran 95 (minor extension to Fortran 90), mostly for the purpose of optimization, when code vectorization was a major thing in HPC. The reason that FORALL is so limited in application is exactly because it was designed for loop optimization. Also note that, FORALL is not a looping construct, but assignment. Thus, only assignment statements are allowed inside the block. In theory, DO loops give explicit instructions about the order of indices that the processor is going to loop over. A FORALL construct allows the compiler to choose the most optimal order based on how the array is stored in memory. However, this has lost meaning over time, since modern compilers are very good at DO loop vectorizations and you are not likely to notice any improvement by using FORALL.
See a nice discussion on FORALL and WHERE here
If you are worried about code performance, you may rather want to consider a different compiler - PGI or ifort. From my own experience, gfortran is suitable for development, but not really for HPC. You will notice up to several times faster execution with code compiled with pgf90 or ifort.

Forall construct proved to be really too restrictive and is mostly useful only for array operations. For exact limitations see IBM Fortran - FORALL. Less restrictive is a do concurrent construct of Fortran 2008. Even read and write statements are allowed there. See Intel Fortran - DO CONCURRENT and New features of Fortran 2008.

Related

Does gcc always do this kind of optimization? (common subexpression elimination)

As an example, assume that the expression sys->pot.atoms[item->P.kind].mass is evaluated inside a loop. The loop only changes item, so the expression can be simplified as atoms[item->P.kind].mass by defining a variable as atoms = sys->pot.atoms before the loop. Do modern compilers like gcc perform this kind of optimization automatically (if optimization is enabled)? And is it reliable regardless of the number of expressions like atoms[item->P.kind].mass existing inside a loop?
Yes it is a very common optimisation called Loop invariant code motion, also called hoisting or scalar promotion, often performed as a side effect of Common subexpression elimination.
It is valid to compute sys->pot.atoms just once before the loop if the compiler can ascertain that neither sys nor sys->pot.atoms can be modified inside the loop.
Note however, as commented by Groo, that if sys or sys->pot or sys->pot.atoms are specified as volatile, then it would be incorrect to compute it only once if the expression sys->pot.atoms is evaluated multiple times in the loop body or expressions.
It's a very common optimization.
And is it reliable regardless of the number of expressions
No, because optimizations is not something you can rely on happening in general. The C standard says nothing about it, so it's up to the maker of the compiler to give guarantees or not. But that's nothing you really do for the optimizer. The optimizer has a "best effort" approach, and a missed optimization is often treated like a flaw rather than an actual bug.
EDIT:
From discussion in comments, I found it useful to mention that just because a certain optimization was performed, that does not guarantee faster code. For instance, the benefit of loop unrolling is that the test in the loop does not need to be performed every iteration. But on the other hand, longer code can be less cache friendly. So asking if it's guaranteed that a certain optimization is performed or not does not really give any useful information.
I always wonder where I should do optimization myself, and where I should sit relax and leave it to the compiler.
That's very hard to know in advance. Guys like Linus Torvalds can basically see the assembly code in their head just by watching the C code, but for us mere mortals, it comes down to benchmarking and profiling.
Before even considering micro optimizations, perform these checks
Make sure that the code you're about to optimize actually is a bottleneck
Make sure you're using a good algorithm
Make sure the code is cache friendly

Does Intel MKL or some similar library provide a vectorized way to count the number of elements in an array fulfilling some condition in C?

The problem
I'm working on implementing and refining an optimization algorithm with some fairly large arrays (from tens of millions of floats and up) and using mainly Intel MKL in C (not C++, at least not so far) to squeeze out every possible bit of performance. Now I've run into a silly problem - I have a parameter that sets maxima and minima for subsets of a set of (tens of millions) of coefficients. Actually applying these maxima and minima using MKL functions is easy - I can create equally-sized vectors with the limits for every element and use V?Fmax and V?Fmin to apply them. But I also need to account for this clipping in my error metric, which requires me to count the number of elements that fall outside these constraints.
However, I can't find an MKL function that allows me to do things like counting the number of elements that fulfill some condition, the way you can create and sum logical arrays with e.g. NumPy in Python or in MATLAB. Irritatingly, when I try to google this question, I only get answers relating to Python and R.
Obviously I can just write a loop that increments a counter for each element that fulfills one of the conditions, but if there is an already optimized implementation that allows me to achieve this, I would much prefer that just owing to the size of my arrays.
Does anyone know of a clever way to achieve this robustly and very efficiently using Intel MKL (maybe with the statistics toolbox or some creative use of elementary functions?), a similarly optimized library that does this, or a highly optimized way to hand-code this? I've been racking my brain trying to come up with some out-of-the box method, but I'm coming up empty.
Note that it's necessary for me to be able to do this in C, that it's not viable for me to shift this task to my Python frontend, and that it is indeed necessary for me to code this particular subprogram in C in the first place.
Thanks!
If you were using c++, count_if from the algorithms library with an execution policy of par_unseq may parallelize and vectorize the count. On Linux at least, it typically uses Intel TBB to do this.
It's not likely to be as easy in c. Because c doesn't have concepts like templates, callables or lambdas, the only way to specialize a generic (library-provided) count()-function would be to pass a function pointer as a callback (like qsort() does). Unless the compiler manages to devirtualize and inline the callback, you can't vectorize at all, leaving you with (possibly thread parallelized) scalar code. OTOH, if you use for example gcc vector intrinsics (my favourite!), you get vectorization but not parallelization. You could try to combine the approaches, but I'd say get over yourself and use c++.
However, if you only need vectorization, you can almost certainly just write sequential code and have the compiler autovectorize, unless the predicate for what should be counted is poorly written, or your compiler is braindamaged.
For example. gcc vectorizes the code on x86 if at least sse4 instructions are available (-msse4). With AVX[2/512] (-mavx / -mavx2 / -mavx512f) you can get wider vectors to do more elements at once. In general, if you're compiling on the same hardware you will be running the program on, I'd recommend letting gcc autodetect the optimal instruction set extensions (-march=native).
Note that in the provided code, the conditions should not use short-circuiting or (||), because then the read from the max-vector is semantically forbidden if the comparison with the min-vector was already true for the current element, severely hindering vectorization (though avx512 could potentially vectorize this with somewhat catastrophic slowdown).
I'm pretty sure gcc is not nearly optimal in the code it generates for avx512, since it could do the k-reg (mask register) or in the mask registers with kor[b/w/d/q], but maybe somebody with more experience in avx512 (*cougth* Peter Cordes *cough*) could weigh in on that.
MKL doesn't provide such functions but You may try to check another performance library - IPP which contains a set of threshold functions that could be useful to your case. Please refer to the IPP Developer Reference to check more details - https://software.intel.com/content/www/us/en/develop/documentation/ipp-dev-reference/top/volume-1-signal-and-data-processing/essential-functions/conversion-functions/threshold.html

Any way to vectorize in C

My question may seem primitive or dumb because, I've just switched to C.
I have been working with MATLAB for several years and I've learned that any computation should be vectorized in MATLAB and I should avoid any for loop to get an acceptable performance.
It seems that if I want to add two vectors, or multiply matrices, or do any other matrix computation, I should use a for loop.
It is appreciated if you let me know whether or not there is any way to do the computations in a vectorized sense, e.g. reading all elements of a vector using only one command and adding those elements to another vector using one command.
Thanks
MATLAB suggests you to avoid any for loop because most of the operations available on vectors and matrices are already implements in its API and ready to be used. They are probably optimized and they work directly on underlying data instead that working at MATLAB language level, a sort of opaque implementation I guess.
Even MATLAB uses for loops underneath to implement most of its magic (or delegates them to highly specialized assembly instructions or through CUDA to the GPU).
What you are asking is not directly possible, you will need to use loops to work on vectors and matrices, actually you would search for a library which allows you to do most of the work without directly using a for loop but by using functions already defined that wraps them.
As it was mentioned, it is not possible to hide the for loops. However, I doubt that the code MATLAB produces is in any way faster the the one produced by C. If you compile your C code with the -O3 it will try to use every hardware feature your computer has available, such as SIMD extensions and multiple issue. Moreover, if your code is good and it doesn't cause too many pipeline stalls and you use the cache, it will be really fast.
But i think what you are looking for are some libraries, search google for LAPACK or BLAS, they might be what you are looking for.
In C there is no way to perform operations in a vectorized way. You can use structures and functions to abstract away the details of operations but in the end you will always be using fors to process your data.
As for speed C is a compiled language and you will not get a performance hit from using for loops in C. C has the benefit (compared to MATLAB) that it does not hide anything from you, so you can always see where your time is being used. On the downside you will notice that things that MATLAB makes trivial (svd,cholesky,inv,cond,imread,etc) are challenging in C.

Loop vectorization and how to avoid it

Loop vectorization is when all right-hand-side expressions are computed at the onset. I just discovered my loops are being vectorized (in FORTRAN 77... don't ask). I need my loop condition variable to be updated in each iteration, but how can I rewrite to work around this vectorization?
In a related post, I'm looking for a way to disable this optimization "feature" in FORTRAN specifically, but here I am looking for a more algorithmic solution to the general case.
That's not what loop vectorisation means to me. To me the phrase means that the compiler will generate code which can take advantage of any vector computation capabilities of the hardware. On a simple Intel Xeon this might mean generating SSE4 instructions to simultaneously manipulate a few adjacent array elements together, on a Cray there may be much more available in terms of simultaneous execution of the same operation on vector registers.
How do you think that all the RHS expressions are 'computed at the onset' ? I'm not sure what you mean by that. Could you post some code to explain ? If you mean that the number of trips through the loop is computed on entry to the first iteration, then that is correct. That is a very useful feature when it comes to optimising code and not one most Fortran programs would benefit from avoiding.
If you are writing DO loops in Fortran updating the iteration variable is forbidden by the standard, and always has been so far as I recall. Your compiler might let you get away with it but I wouldn't trust a Fortran program in which this happened.

C coding practices for performance or code size - beyond what a compiler does

I'm looking to see what can a programmer do in C, that can determine the performance and/or the size of the generated object file.
For e.g,
1. Declaring simple get/set functions as inline may increase performance (at the cost of a larger footprint)
2. For loops that do not use the value of the loop variable itself, count down to zero instead of counting up to a certain value
etc.
It looks like compilers now have advanced to a level where "simple" tricks (like the two points above) are not required at all. Appropriate options during compilation do the job anyway. Heck, I also saw posts here on how compilers handle recursion - that was very interesting! So what are we left to do at a C level then? :)
My specific environment is: GCC 4.3.3 re-targeted for ARM architecture (v4). But responses on other compilers/processors are also welcome and will be munched upon.
PS: This approach of mine goes against the usual "code first!, then benchmark, and finally optimize" approach.
Edit: Just like it so happens, I found a similar post after posting the question: Should we still be optimizing "in the small"?
One thing I can think of that a compiler probably won't optimize is "cache-friendliness": If you're iterating over a two-dimensional array in row-major order, say, make sure your inner loop runs across the column index to avoid cache thrashing. Having the inner loop run over the wrong index can cause a huge performance hit.
This applies to all programming languages, but if you're programming in C, performance is probably critical to you, so it's especially relevant.
"Always" know the time and space complexity of your algorithms. The compiler will never be able to do that job as well as you can. :)
Compilers these days still aren't very good at vectorizing your code so you'll still want to do the SIMD implementation of most algorithms yourself.
Choosing the right datastructures for your exact problem can dramatically increase performance (I've seen cases where moving from a Kd-tree to a BVH would do that, in that specific case).
Compilers might pad some structs/ variables to fit into the cache but other cache optimizations such as the locality of your data are still up to you.
Compilers still don't automatically make your code multithreaded and using openmp, in my experience, doesn't really help much. (You really have to understand openmp anyway to dramatically increase performance). So currently, you're on your own doing multithreading.
To add to what Martin says above about cache-friendliness:
reordering your structures such that fields which are commonly accessed together are in the same cache line can help (for instance by loading just one cache line rather than two.) You are essentially increasing the density of useful data in your data cache by doing this. There is a linux tool which can help you in doing this: dwarves 1. http://www.linuxinsight.com/files/ols2007/melo-reprint.pdf
you can use a similar strategy for increasing density of your code. In gcc you can mark hot and cold branches using likely/unlikely tags. That enables gcc to keep the cold branches separately which helps in increasing the icache density.
And now for something completely different:
for fields that might be accessed (read and written) across CPUs, the opposite strategy makes sense. The trouble is that for coherence purposes only one CPU can be allowed to write to the same address (in reality the same cacheline.) This can lead to a condition called cache-line ping pong. This is pretty bad and could be worse if that cache-line contains other unrelated data. Here, padding this contended data to a cache-line length makes sense.
Note: these clearly are micro-optimizations, to be done only at later stages when you are trying to wring the last bits of performance from your code.
PreComputation where possible... (sorry but its not always possible... I did extensive precomputation on my chess engine.) Store those results in memory, keeping cache in mind.. the bigger the size of precomputation data in memory the lesser is the chance of doing a cache hit. Since most of recent hardware is multicore you can design your application to target it.
if you are using several big arrays make sure you group them close to each other on where they would be used, boosting cache hits
Many people are not aware of this: Define an inline label (varies by compiler) which means inline, in its intent - many compilers place the keyword in an entirely different context from the original meaning. There are also ways to increase the inline size limits, before the compiler begins popping trivial things out of line. Human directed inlining can produce much faster code (compilers are often conservative, or do not account for enough of the program), but you need to learn to use it correctly, because it can (easily) be counterproductive. And yes, this absolutely applies to code size as well as speed.

Resources