does -march flag in compilers (for example: gcc) really matters?
would it be faster if i compile all my programs and kernel using -march=my_architecture instead of -march=i686
Yes it does, though the differences are only sometimes relevant. They can be quite big however if your code can be vectorized to use SSE or other extended instruction sets which are available on one architecture but not on the other. And of course the difference between 32 and 64 bit can (but need not always) be noticeable (that's -m64 if you consider it a type of -march parameter).
As anegdotic evidence, a few years back I run into a funny bug in gcc where a particular piece of code which was run on a Pentium 4 would be about 2 times slower when compiled with -march=pentium4 than when compiled with -march=pentium2.
So: often there is no difference, and sometimes there is, sometimes it's the other way around than you expect. As always: measure before you decide to use any optimizations that go beyond the "safe" range (e.g. using your actual exact CPU model instead of a more generic one).
There is no guarantee that any code you compile with march will be faster/slower w.r.t. the other version. It really depends on the 'kind' of code and the actual result may be obtained only by measurement. e.g., if your code has lot of potential for vectorization then results might be different with and without 'march'. On the other hand, sometimes compiler do a poor job during vectorization and that might result in slower code when compiled for a specific architecture.
Related
I am writing a c code and I would like to know if making simple operation like multiplication more CPU friendly makes any difference and the code faster. For example, replacing this line of code:
y = x * 15;
with
y = x << 4;
y -= x;
Does the compiler already do that? Should I use the -O2 option in order to make it happen?
There are two parts to the answer:
No, unless you are writing a very specialized function (e.g. a signal processing function that must execute in 20 clocks) you should not optimize; leave that to the compiler. In general your job is to write readable code, and the compiler will (to it's capability optimize it). Note that the optimization will be different for different processors as their hardware (computational capability) can be very different. For example a shift by N instruction (like that in your code) may take N clocks on a processor with a regular shifter, but it will take a single clock (or less) on processors with a hardware barrel shifter.
Yes, most modern optimizing compilers will optimize (for example replace multiplication by shifting where appropriate) without explicit optimization options.
Summarized, optimize only in rare situations when you already know that the compiler didn't do a good job, it is a problem that must be addressed, you know well how to do better than the compiler, and the resulting increased maintenance cost is worth it.
Optimizing code by hand is almost always an exercise in futility these days, especially in higher level languages. While C is almost assembly, modern compilers have a lot more tricks built into them than most people are aware of.
In addition, unless the code you're optimizing is going to be used a lot, i.e., millions of times in close succession, the work of optimizing the code will cost more time than the savings you achieve.
With that said, the only way to see if your code is measurably faster would be to test it: Put each version in a tight loop and execute it a million (or more) times, and see if there's a noticeable difference.
Note that your optimization is for a specific multiplier - any other operand you use it for is going to yield different results. Because it can't be generalized, there's little likelyhood this optimization will be done by any compiler in all cases - and just looking at the code and not knowing what processor architecture it will be run on, I can't say whether it would be faster or not.
I've got a very hot instruction loop which needs to be properly aligned on 32-bytes boundaries to maximize Intel's Instruction Fetcher effectiveness.
This issue is specific to Intel not-too-old line of CPU (from Sandy Bridge onward). Failure to align properly the beginning of the loop results in up to 20 % speed loss, which is definitely too noticeable.
This issue is pretty rare, one needs a highly optimized set of instructions for the instruction fetcher to become the bottleneck. But fortunately, it's not a unique case. Here is a nice article explaining in details how such a problem can be detected.
The problem is, gcc nor clang would care aligning properly this instruction loop. It makes compiling this code a nightmare producing random outcome, depending on how "good" the hot loop is aligned by chance. It also means that modifying a totally unrelated function can nonetheless highly impact performance of the hot loop.
Already tried several compiler flags, none of them gives satisfying result.
[Edit] More detailed description of tried compilation flags :
-falign-functions=32 : no impact or negative impact
-falign-jumps=32 : no impact
-falign-loops=32 : works fine when the hot loop is isolated into a tiny piece of test code. But in normal build, the compilation flag is applied across the entire source, and in this case it is detrimental : aligning all loops on 32-bytes is bad for performance. Only the very hot ones benefit from it.
Also attempted to use __attribute__((optimize("align-loops=32"))) in the function declaration. Doesn't produce any effect (identical binary generated, as if the the statement wasn't there). Later confirmed by gcc support team to be effectively ignored. Edit : #Jester indicates in comment that the statement works with gcc 5+. Unfortunately, my dev station uses primarily gcc 4.8.4, and this is more a problem of portability, since I don't control the final compiler used in the build process.
Only building using PGO can reliably produce expected performance, but PGO cannot be accepted as a solution since this piece of code will be integrated into other programs using their own build chain.
So, I'm considering inline assembly.
This would be specific to x64 instruction set, so no portability required.
If my understanding is correct, assembly like NASM allows the use of statements such as : ALIGN 32 which would force the next instruction to be aligned on 32 bytes boundaries.
Since the target source code is in C, it would be necessary to include this statement. For example, something like asm("ALIGN 32");
(which of course doesn't work).
I hope it's mostly a matter of knowing the right instruction to write, and not something deeper such as "it's impossible".
Similarly to NASM, the GNU assembler supports the .align pseudo OP for alignment:
volatile asm (".align 32");
For a non-assembly solution, you could try to supply -falign-loops=32 and possibly -falign-functions=32, -falign-jumps=32 as needed.
I am using vbcc compiler to translate my C code into Motorola 68000 ASM.
For whatever reason, every time I use the division (just integer, not floats) in code, the compiler only inserts the following stub into the ASM output (that I get generated upon every recompile):
public __ldivs
jsr __ldivs
I explicitly searched for all variations of DIVS/DIVU, but every single time, there is just that stub above. The code itself works (I debugged it on target device), so the final code does have the DIV instruction, just not the intermediate output.
Since this is the most expensive instruction and it's in an inner loop, I really gotta experiment with tweaking the code to get the max performance of it.
However, I can't do it if I don't see the resulting ASM code. Any ideas how to enable it ? The compiler manual does not specify anything like that, so there must clearly must be some other - probably common - higher principle in play ?
From the vbcc compiler system manual by Volker Barthelmann:
4.1 Additional options
This backend provides the following additional options:
-cpu=n Generate code for cpu n (e.g. -cpu=68020), default: 68000.
...
4.5 CPUs
The values of -cpu=n have those effects:
...
n>=68020
32bit multiplication/division/modulo is done with the mul?.l, div?.l and
div?l.l instructions.
The original 68000 CPU didn't have support for 32-bit divides, only 16-bit division, so by default vbcc doesn't generate 32-bit divide instructions.
Basically your question doesn't even belong here. You're asking about the workings of your compiler not the 68K cpu family.
Since this is the most expensive instruction and it's in an inner loop, I really gotta experiment with tweaking the code to get the max performance of it.
Then you are already fighting windmills. Chosing an obscure C compiler while at the same time desiring top performance are conflicting goals.
If you really need MC68000 code compatibility, the choice of C is questionable. Since the 68000 has zero cache, store/load orgies that simple C compilers tend to produce en masse, have a huge performance impact. It lessens considerably for the higher members and may become invisible on the superscalar pipelined ones (erm, one; the 68060).
Switch to 68020 code model if target platform permits, and switch compiler if you're not satisfied with your current one.
I wanted to take my first steps with Intel's SSE so I followed the guide published here, with the difference that instead of developing for Windows and C++ I make it for Linux and C (therefore I don't use any _aligned_malloc but posix_memalign).
I also implemented one computing intensive method without making use of the SSE extensions. Surprisingly, when I run the program both pieces of code (that one with SSE and that one without) take similar amounts of time to run, usually being the time of the one using the SSE slightly higher than the other.
Is that normal? Could it be possible that GCC does already optimize with SSE (also using -O0 option)? I also tried the -mfpmath=387 option, but no way, still the same.
For floating point operations you may not see a huge benefit with SSE. Most modern x86 CPUs have two FPUs so double precision may only be about the same speed for SIMD vs scalar, and single precision might give you 2x for SIMD over scalar on a good day. For integer operations though, e.g. image or audio processing at 8 or 16 bits, you can still get substantial benefits with SSE.
GCC has a very good inbuilt code vectorizer, (which iirc kicks in at -O0 and above), so this means it will use SIMD in any place that it can in order to speed up scalar code (it will also optimize SIMD code a bit too, if its possible).
its pretty easy to confirm this is indeed whats happening here, just disassemble the output (or have gcc emit commented asm files).
Has anyone taken advantage of the automatic vectorization that gcc can do? In the real world (as opposed to example code)? Does it take restructuring of existing code to take advantage? Are there a significant number of cases in any production code that can be vectorized this way?
I have yet to see either GCC or Intel C++ automatically vectorize anything but very simple loops, even when given the code of algorithms that can (and were, after I manually rewrote them using SSE intrinsics) be vectorized.
Part of this is being conservative - especially when faced with possible pointer aliasing, it can be very difficult for a C/C++ compiler to 'prove' to itself that a vectorization would be safe, even if you as the programmer know that it is. Most compilers (sensibly) prefer to not optimize code rather than risking miscompiling it. This is one area where higher level languages have a real advantage over C, at least in theory (I say in theory since I'm not actually aware of any automatically vectorizing ML or Haskell compilers).
Another part of it is simply analytical limitations - most research in vectorization, I understand, is related to optimizing classical numerical problems (fluid dynamics, say) which was the bread and butter of most vector machines before a few years ago (when, between CUDA/OpenCL, Altivec/SSE, and the STI Cell, vector programming in various forms became widely available in commercial systems).
It's fairly unlikely that code written for a scalar processor in mind will be easy for a compiler to vectorize. Happily, many things you can do to make it easier for a compiler to understand how to vectorize it, like loop tiling and partial loop unrolling, also (tend to) help performance on modern processors even if the compiler doesn't figure out how to vectorize it.
It is hard to use in any business logic, but gives speed ups when you are processing volumes of data in the same way.
Good example is sound/video processing where you apply the same operation to every sample/pixel.
I have used VisualDSP for this, and you had to check the results after compiling - if it is really used where it should.
Vectorized instructions are not limited to Cell processors - most modern workstations-like CPU have them (PPC, x86 since pentium 3, Sparc, etc...). When used well for floating points operations, it can help quite a lot for very computing intensive tasks (filters, etc...). In my experience, automatic vectorization does not work so well.
You may have noticed that pretty much no-one actually knows how to make good use of GCC's Automatic Vectorization. If you search around the web to see people's comments, it always come to the idea that GCC allows you to enable automatic vectorization, but it extremely rarely makes actual use of it, and so if you want to use SIMD acceleration (eg: MMX, SSE, AVX, NEON, AltiVec), then you basically haveto figure out how to write it using compiler intrinsics or Assembly language code.
But the problem with intrinsics is that you effectively need to understand the Assembly language side of it and then also learn the Intrinsics method of describing what you want, which is likely to result in much less efficient code than if you wrote it in Assembly code (such as by a factor of 10x), because the compiler is still going to have trouble making good use of your intrinsic instructions!
For example, you might be using SIMD Intrinsics so that many operations can be performed in parallel at the same time, but your compiler will probably generate Assembly code that transfers the data between the SIMD registers and the normal CPU registers and back, effectively making your SIMD code run at a similar speed (or even slower) than normal code!
So basically:
If you want upto 100% speedups (2x
speed), then either buy the
official Intel/ARM compilers or convert some of your code to use SIMD C/C++ Intrinsics.
If you
want 1000% speedups (10x speed), then
write it in Assembly code using SIMD instructions by hand. Or if available on your hardware, use GPU acceleration instead such as OpenCL or Nvidia's CUDA SDK, since they can provide similar speedups in the GPU as SIMD does in the CPU.