Disable vectorized looping in FORTRAN? - loops

Is it possible to bypass loop vectorization in FORTRAN? I'm writing to F77 standards for a particular project, but the GNU gfortran compiles up through modern FORTRANs, such as F95. Does anyone know if certain FORTRAN standards avoided loop vectorization or if there are any flags/options in gfortran to turn this off?
UPDATE: So, I think the final solution to my specific problem has to "DO" with the FORTRAN DO loops not allowing the updating of the iteration variable. Mention of this can be found in #High Performance Mark's reply on this related thread... Loop vectorization and how to avoid it
[Into the FORT, RAN the noobs for shelter.]

The Fortran standards are generally silent on how the language is to be implemented, leaving that to the compiler writers who are in a better position to determine the best, or good (and bad) options for implementation of the language's various features on whatever chip architecture(s) they are writing for.
What do you mean when you write that you want to bypass loop vectorisation ? And in the next sentence suggest that this would be unavailable to FORTRAN77 programs ? It is perfectly normal for a compiler for a modern CPU to generate vector instructions if the CPU is capable of obeying them. This is true whatever version of the language the program is written in.
If you really don't want to generate vector instructions then you'll have to examine the gfortran documentation carefully -- it's not a compiler I use so I can't point you to specific options or flags. You might want to look at its capabilities for architecture-specific code generation, paying particular attention to SSE level.
You might be able to coerce the compiler into not vectorising loops if all your loops are explicit (so no whole-array operations) and if you make your code hard to vectorise in other ways (dependencies between loop iterations for example). But a good modern compiler, without interference, is going to try its damndest to vectorise loops for your own good.
It seems rather perverse to me to try to force the compiler to go against its nature, perhaps you could explain why you want to do that in more detail.

As High Performance Mark wrote, the compiler is free to select machine instructions to implement your source code as long as the results follow the rules of the language. You should not be able to observe any difference in the output values as a result of loop vectorization ... you code should run faster. So why do you care?
Sometimes differences can be observed across optimization levels, e.g., on some architectures registers have extra precision.
The place to look for these sorts of compiler optimizations is the gcc manual. They are located there since they are common across the gcc compiler suite.

With most modern compilers, the command-line option -O0 should turn off all optimisations, including loop vectorisation.
I have sometimes found that this causes bugs to apparently disappear. However usually this means that there is something wrong with my code so if this sort of thing is happening to you then you have almost certainly written a buggy program.
It is theoretically possible but much less likely that there is a bug in the compiler, you can easily check this by compiling your code in another fortran compiler. (e.g. gfortran or g95).

gfortran doesn't auto-vectorize unless you have set -O3 or -ftree-vectorize. So it's easy to avoid vectorization. You will probably need to read (skim) the gcc manual as well as the gfortran one.
Auto-vectorization has been a well-known feature of Fortran compilers for over 35 years, and even the Fortran 77 definition of DO loops was set with this in mind (and also in view of some known non-portable abuses of F66 standard). You could not count on turning off vectorization as a way of making incorrect code work, although it might expose symptoms of incorrect code.

Related

Can you do all gcc optimizations (-O2, -O3) manually in your c source code?

In my class project, my project is set to use gcc's optimization level of -O0 (no optimizations) and we are not allowed to change it for the final submission.
I tested my code using -O2 and got around a 2x speedup of my entire program. So I was wondering, is it possible to go through each optimization that -O2 does, and manually do those optimizations in my code? Or are some of the -O2 optimizations internal to the stack, frame, machine/assembly, etc, thus disallowing me, the programmer, from manually making those optimizations in my source code (If that makes sense)
Is it possible to go through each optimization that -O2 does, and manually do those optimizations in my code?
No. Many of the optimizations performed by the compiler cannot be represented in C. Some of these include:
Disabling the frame pointer
Removing unnecessary register saves/restores at the beginning and end of a function
"Peephole" optimizations on the assembly, such as removing redundant moves, loads, or stores
Inserting no-ops to align loops to specific address boundaries (typically 16 bytes)
This isn't to say that all of the optimizations performed by the compiler are untranslatable, of course -- merely that some of them are.
Yes, but that's the same as building your own 8086-class microprocessor in Minecraft — not worth your time and effort. And yes, many of those optimizations involve stuff below the language level of abstraction. Your professor might have unknown-to-you reasons for wanting an unoptimized executable.

What's the purpose of using assembly language inside a C program?

What's the purpose of using assembly language inside a C program? Compilers are able to generate assembly language already. In what cases would it be better to write assembly than C? Is performance a consideration?
In addition to what everyone said: not all CPU features are exposed to C. Sometimes, especially in driver and operating system programming, one needs to explicitly work with special registers and/or commands that are not otherwise available.
Also vector extensions.
That was especially true before the advent of compiler intrinsics. Those alleviate the need for inline assembly somewhat.
One more use case for inline assembly has to do with interfacing C with reflected languages. Specifically, assembly is all but necessary if you need to call a function when its prototype is not known at compile time. In other words, when the quantity and datatypes of that function's arguments are but runtime variables. C variadic functions and the stdarg machinery won't help you in this case - they would help you parse a stack frame, but not build one. In assembly, on the other hand, it's quite doable.
This is not an OS/driver scenario. There are at least two technologies out there - Java's JNI and COM Automation - where this is a must. In case of Automation, I'm talking about the way the COM runtime is marshaling dual interfaces using their type libraries.
I can think of a very crude C alternative to assembly for that, but it'd be ugly as sin. Slightly less ugly in C++ with templates.
Yet another use case: crash/run-time error reporting. For postmortem debugging, you'd want to capture as much of program state at the point of crash as possible (i. e. all the CPU registers), and assembly is a much better vehicle for that than C. Postmortem debugging of crashing native code usually involves staring at the assembly anyway.
Yet another use case - code that is intended for execution in another process without that process' co-operation or knowledge. This is often referred to as "shellcode", but it doesn't have to be shell related. Code like that needs to be very carefully written, and it can't rely on the conveniences of a high level language (like the run time library, or having a data section) that are normally taken for granted. When one is after injecting a significant piece of functionality into a target process, they usually end up loading a dynamic library, but the initial trampoline code that loads the library and passes control to it tends to be in assembly.
I've been only covering cases where assembly is necessary. Hand-optimizing for performance is covered in other answers.
There are a few, although not many, cases where hand-optimized assembly language can be made to run more efficiently than assembly language generated by C compilers from C source code. Also, for developers used to assembly language, some things can just seem easier to write in assembler.
For these cases, many C compilers allow inline assembly.
However, this is becoming increasingly rare as C compilers get better and better and producing efficient code, and most platforms put restrictions on some of the low-level type of software that is often the type of software that benefits most from being written in assembler.
In general, it is performance but performance of a very specific kind. For example, the SIMD parallel instructions of a processor might not generated by the compiler. By utilizing processor specific data formats and then issuing processor specific parallel instructions (e.g. ARM NEON or Intel SSE), very fast performance on graphics or signal processing problems can occur. Even then, some compilers allow these to be expressed in C using intrinsic functions.
While it used to be common to use assembly language inserts to hand-optimize critical functions, those days are largely done. Modern compilers are very good and modern processors have very complicated timing requirements so hand optimized code is often less optimal than expected.
There were various reasons to write inline assemblies in C. We can simply categorize the reasons into necessary and unnecessary.
For the reasons of unnecessary, possibly be:
platform compatibility
performance concerning
code optimization
etc.
I consider above as unnecessary because sometime they can be discard or implemented through pure C. For example of platform compatibility, you can totally implement particular version for each platform, however, use inline assemblies might reduce the effort. Here we are not going to talk too much about the unnecessary reasons.
For necessary reasons, they possibly be:
something with standard libraries was insufficient to do
some instruction set was not supported by compilers
object code generated incorrectly
writing stack-sensitive code
etc.
These reasons considered necessary, because of they are almost not possibly done with pure C language. For example, in old DOSes, software interrupt INT21 was not reentrantable. If you want to write a Virtual Dirve fully use INT21 supported by the compiler, it was impossible to do. In this situation, you would need to hook the original INT21, and make it reentrantable. However, the compiled code wraps your every call with prolog/epilog. Thus, you can never break something restricted, or you just crashed the code. You can try any of trick by using the pure language of C with libraries; but even you can successfully find a trick, that would mean you found a particular order that the compiler generates the machine code; this is implying: you tried to let the compiler compiles your code to exactly machine code. So, why not just write inline assemblies directly?
This example explained all above of necessary reasons except instruction set not supported, but I think that was easy to think about.
In fact, there're more reasons to write inline assemblies, but now you have some ideas of them, and so on.
Just as a curiosity, I'm adding here a concrete example of something not-so-low-level you can only do in assembly. I read this in an assembly book from my university time where it was used to show an inherent limitation of C/C++, and how to overcome it with assembly.
The problem is how do I invoke a function when the exact number of parameters is only known at runtime? In fact, in C/C++ you can easily define a function that takes a variable number of arguments like printf. But when it comes to calling that function, the compiler wants to know exactly how many parameters must be passed. You may pass more paremters than required, that won't do any harm. But what if the number grows unexpectedly to 100 or 1000 parameters, and must be picked out of an array?
The solution of course is using assembly, where you can dynamically create a stack frame of the proper size, copy the parameters on the stack, invoke the function, and finally reset the stack.
In practice, this would hardly ever be a limitation (except if the library you're using is really really bad designed). People who use assembly in C have much better reasons to do so like others have pointed out in their answers. Still, I think may be an interesting fact to know.
I would rather think of that as a way to write a very specific code for a specific platform, optimization, though still common, is used less nowadays. Knowledge and usage of assembly in C is also practiced by all-color hats.

Performance of compiled code by compiled compiler

If I want to achieve better performance from, let's say for example, MySQLdb, I can compile it myself and I will get better performance because it's not compiled on i386, i486 or what ever, just on my CPU. Further I can choose the compile options and so on...
Now, I was wondering if this is true also for non-regular Software, such as compiler.
Here come the 1st part:
Will compiling a compiler like GCC result in better performance?
and the 2nd part:
Will the code compiled by my own compiled compiler perform better?
(Yes, I know, I can compile my compiler and benchmark it... but maybe ... someone already knows the answer, and will share it with us =)
In answer to your first question, almost certainly yes. Binary versions of gcc will be the "lowest common denominator" and, if you compile them with special flags more appropriate to your system, it will most likely be faster.
As to your second question, no.
The output of the compiler will be the same regardless of how you've optimised it (unless it's buggy, of course).
In other words, even if you totally stuffed up your compiler flags when compiling gcc, to the point where your particular compiled version of gcc takes a week and a half to compile "Hello World", the actual "Hello World" executable should be identical to the one produced by the "lowest common denominator" gcc (if you use the same flags).
(1) It is possible. If you introduce a new optimization to your compiler, and re-compile it with this optimization included - it is possible that the re-compiled code will perform better.
(2) No!!!! A compiler cannot change the logic of the code! In your case, the logic of the code is the native code produced at the end. So, if compiler A_1 is compiled using compiler A_2 or B, has no affect on the native code produced by A_1 [in here A_1, A_2 are the same compilers, the index is just for clarity].
a.Well, you can compile the compiler to your system, and maybe it will run faster. like any program. (I think that usualy it's not worth it, but do whatever you want).
b. No. Even if you compile the compiler in your computer, it's behavior should not change, and so the code that it generates also doesn't change.
Will compiling a compiler like GCC result in better performance?
A program compiled specifically to the target platform it is used on will usually perform better than a program compiled for a generic platform. Why is this? Knowledge about the harware can help the compiler align data to be cache friendly and choose an instruction ordering that plays well with a CPUs pipelining.
The most benefit is usally achieved by leveraging specific instruction sets such as SSE (in its various versions).
On the other hand, you should ask yourself if a programm like GCC is really CPU bound (much more likely it will be IO bound) and tuning its CPU performance provides any measurable benefit.
Will the code compiled by my own compiled compiler perform better
Hopefully not! Allowing a compiler to optimize a program should never change its behavior. No matter how you compiled your GCC, it should compile code to the same binaries as a generic binary distribution of GCC would.
If code compiled to the specific platform is faster than code compil for a generic platform, why dont we all ship code instead of binaries? Guess what, some linux distros actually follow this phillosophy, such as Gentoo. And while you're at it, make sure to built statically linked binaries, disk space is so cheap nowadays and it gives you at least another 0.001% of performance.
Alright, that was a bit sarcastic. The reason people distribute generic binaries is pretty obvious: It's geneirc, the lowest common denominator and it will work everywhere. Thats a big bonus in terms of flexibility and user friendlyness. I remember once compiling Gnome for my Gentoo box, it took a day or two! (But it must have been so much faster ;-) )
On the other hand, there are occassions where you want to get the best performance possible and it makes sense to build and optimize for specific architctures.
GCC uses a three step bootstraping when building from source. Basically it compiles the source three times to ensure build tools and compiler is build successfully. This bootstraping is used for validation purpose. However it is possible to use the stage 1 as a benchmark for optimizing later stages. You should build GCC with make profiledbootstrap to use this profile based optimization.
This profile based build process increases the performance of "GCC", but not the software compiled with it, as other answers point out.

Why would gcc -o0 be faster than icc -o0?

For a brief report I have to do, our class ran code on a cluster using both gcc -O0 and icc -O0. We found that gcc was about 2.5 times faster than icc without any optimizations? Why is this? Does gcc -O0 actually do some minor optimization or does it simply happen to work better for this system?
The code was an implementation of the naive string searching algorithm found here, written in c.
Thank you
Performance at -O0 is not interesting or indicative of anything. It explicitly says "I don't care about performance", and the compiler takes you up on that; it just does whatever happens to be simplest. By random luck, what is simplest for GCC is faster than what is simplest for ICC for one highly specific microbenchmark on your specific hardware configuration. If you ran 100 other microbenchmarks, you would probably find some where ICC is faster, too. Even if you didn't, that still wouldn't mean much. If you're going to compare performance across compilers, turn on optimizations, because that's what you do if you care about performance.
If you want to understand why one is faster, profile the execution. Where is the execution time being spent? Where are there stalls? Why do those stalls occur?
A few things to take into account:
The instruction set each compiler uses by default. For example if your GCC build produces i686 code by default, while ICC restricts itself to i586 opcodes, you would probably see a significant performance difference.
The actual CPUs in your cluster. If you are using AMD processors, instead of Intel CPUs, then ICC is at a disadvantage because it is, of course, targeted specifically to Intel processors.
You mentioned using a cluster. Does this speed difference exist on a single processor as well? If you used any parallelisation facilities provided by your compiler, there could be significant differences there.
Simplistically, when optimisations are disabled, the compiler uses pre-made "templates" for each code construct. Since these templates are intended to be optimised afterwards, they are constructed in a way that enables the optimisation passes to produce better code. The fact that they may be slower or faster with -O0 does not really mean anything - for example, more explicit initial code could be easier to optimise but far slower to execute.
That said, the only way to find out what is going on is to profile the execution of your code and, if necessary, have a look at the assembly of those parts of the code where the major differences lie.

How to compare compilers

What pointers do you use to compare between compilers?
I'm told gcc is the best C compiler, is this true? If so, why?
I mean this generally, so you can state which compiler is more appropriate for which architecture.
(I hear igc would be more appropriate for Intel for instance, but I don't know why)
Personally I intend to use AMD 64 bit, develop both in Linux and Windows, GUI and non GUI apps.
Um, dunno where you heard that gcc is the "best C compiler". It's simply the most ubiquitous and also a lot better than the native C compilers provided by most commercial UNIX vendors when gcc came about in the 1990s.
But what defines the "best"?
Time to compile code;
Size of compiled code;
Speed of compiled code;
Memory usage of compiled code;
Bugs and probability of seg faulting;
Support;
Community;
etc.
Different things matter to different people.
Here's one set of metrics comparing gcc to Intel's compiler and another comparison with clang. I'm sure you can find some comparisons to Microsoft's compiler too.
Generally speaking, people aren't all that concerned with the relateive size or speed or a compiler (or even necessarily with the size or speed of the output less than a factor of two) but whether it works or not (this was a real issue a decade or two ago), whether it supports the relevant standards and whether it has any oddities/bugs/features you have to workaround.
In general: first of all, the most important aspect of compiler quality is correctness. A compiler with bugs or unexpected behaviour can really wreck your day.
The quality of the resulting code, like speed, size and memory usage, is also at the top of the list.
The speed of compilation is another aspect, especially when compiling large projects.
One thing I find particularly important is error handling, the quality of messages you get when the compiler encounters stuff it can't (or won't) handle.
Correctness is the sine qua non.
I also like
To have a compiler that runs really fast (like lcc or ocamlc)
To have a compiler that produces really good code (like ocamlopt or MLton)
It's OK if they are two different compilers.
I hate having a compiler that makes programs break when a new version comes out. (Richard Stallman, phone your office.)
I know that the INTEL and MS compilers have started doing code generation for SSE3/4 instructions and doing clever things like unfolding loops and supporting vectorisation in the compiler. Not sure GCC does this yet.
I always thought that error messages and warnings make a difference. Some compilers will make it unnecessarily difficult for you to understand what they are trying to say. Others are way more user-friendly. It's also nice when you can enable warnings without the compiler warning you endlessly about stuff it created itself.
Do you mean
best by speed of compiling
best by smalest code
best by fastest code?
You can create a test app, probably with some nasty code (that needs an intelligent optimizer) and use all compilers to test it.
Compare your benchmarks and use the one you like the most.
gcc is a horrible compiler. It has the BEST tech support perhaps because of its price, the number of users and the internet (and google for finding that help). But its output is average to below average at best as far as the quality of the machine code it generates.

Resources