I am working on Nehalam/westmere Intel micro architecture CPU. I want to optimize my code for this Architecture. Are there any specialized compilation flags or C functions by GCC which will help me improve my code's run time performance?
I am already using -O3.
Language of the Code - C
Platform - Linux
GCC Version - 4.4.6 20110731 (Red Hat 4.4.6-3) (GCC)
In my code I have some floating point comparison and they are done over a million time.
Please assume the code is already best optimized.
First, if you really want to profit from optimization on newer processors like this one, you should install the newest version of the compiler. 4.4 came out some years ago, and even if it still seems maintainted, I doubt that the newer optimization code is backported to that. (Current version is 4.7)
Gcc has a catch-all optimization flag that usually should produce code that is optimized for the compilation architecture: -march=native. Together with -O3 this should be all that you need.
Warning: the answer is incorrect.
You can actually analyze all disabled and enabled optimizations yourself. Run on your computer:
gcc -O3 -Q --help=optimizers | grep disabled
And then read about the flags that are still disabled and can according to the gcc documentation influence performance.
You'll want to add an -march=... option. The ... should be replaced with whatever is closest to your CPU architecture (there tend to be minor differences) described in the i386/x86_64 options for GCC here.
I would use core2 because corei7 (the one you'd want) is only available in GCC 4.6 and later. See the arch list for GCC 4.6 here.
If you really want to use a gcc so old that it doesn't support corei7, you could use -mtune=barcelona
Related
I searched a little bit and one google search was enough to discover the differences between the gcc and cc compilers, but I did not find the advantages in using one or another to compile C programs
Which compiler should I use? and why?
The compiler installed as part of X-Code on OS/X is a recent version of clang whose development is sponsored by Apple.
gcc is not provided nor supported by Apple.
Unless you install gcc explicitly from one of its distributions, gcc is an alias for clang on OS/X, just like cc.
The reason for this is to support packages that use gcc explicitly as the C compiler.
On your system, it does not matter which alias you use, the compiler invoked will be clang, which has a high degree of compatibility with gcc extensions but generates different code. Both are very advanced and dependable.
I have written a program with AVX intrinsics, which works well using Ubuntu 12.4 LTS and GCC 4.6 with the following compilation line: g++ -g -Wall -mavx ProgramName.cc -o ProgramName
The problem started When i have updated the compiler up to 4.7 and 4.8.1 versions to support the 16-bit AVX2 intrinsics, which is not supported in gcc 4.6
Currently, the updated gcc version compiles both AVX and AVX2 programs properly. However, it gives me the following error when i run the program: Illegal instruction (core dumped), although it was working on gcc 4.6
My question is: what is prefect way to compile and run both AVX and AVX2 intrinsics
If you tell gcc to use AVX2, it will do so, regardless of whether your CPU supports them or not. That can be useful for cross-compiling or for examining gcc's code generation, but it's not particularly helpful for running programs. If your program crashes with an illegal instruction exception, it is most likely that your CPU does not support the AVX2 extension.
On i386 and x86-64 platforms (and in certain other circumstances), you can specify the gcc option -march=native to generate code for the host machines instruction code. The compiled code might not work on another machine with fewer capabilities, but it should allow you to use all the features of your machine.
While -march=native is a good solution for generating executables, it does not actually help much with writing code; you still need to tailor the instrinsics for the target's architecture, and writing code which can take advantage of CPU features without relying on them gets complicated. I don't know of a good C solution, but there are several C++ template frameworks available.
Upgrading to gcc 4.8 likely pulled in AVX512, so you would have needed to limit the generated instr mix to ONLY AVX2 for your machine.
I am getting the following error while trying to compile some code for an ARM Cortex-M4
using
gcc -mcpu=cortex-m4 arm.c
`-mcpu=' is deprecated. Use `-mtune=' or '-march=' instead.
arm.c:1: error: bad value (cortex-m4) for -mtune= switch
I was following GCC 4.7.1 ARM options. Not sure whether I am missing some critical option. Any kickstart for using GCC for ARM will also be really helpful.
As starblue implied in a comment, that error is because you're using a native compiler built for compiling for x86 CPUs, rather than a cross-compiler for compiling to ARM.
GCC only supports a single general architecture type in any given compiler binary -- so, although the same copy of GCC can compile for both 32-bit and 64-bit x86 machines, you can't compile to both x86 and ARM with the same copy of GCC -- you need an ARM-specific GCC.
(As auselen suggests, getting a pre-built one will save you quite a lot of work, even if you're only using it as a starting point to get things set up. You need to have GCC, binutils, and a C library as a minimum, and those are all separate open-source projects that the pre-built versions have already done the work of combining. I'll recommend Sourcery CodeBench Lite since that's the one my company makes and I do think it's a fairly good one.)
As the error message says -mcpu is deprecated, and you should use the other options stated. However "deprectated" simply means that its use may not continue to be supported; it will still work.
ARM Cortex-M4 is ARM Architecture V7E-M, so you should use -march=armv7-m (the documentation does not specifically list armv7e-m, but that may have been added since the documentation was last updated. The E is essentially the difference between M3 and M4 - the DSP instructions, so the compiler will not generate code that takes advantage of these instructions. Using ARM's Cortex-M DSP library is probably the best way to use these instructions to benefit your application. If your part has an FPU, then other options will be needed enable code generation for that.
Like others already pointed out, you are using a compiler for your host machine, and you need a compiler for generating code for your target processor instead (a cross compiler). Like #Brooks suggested, you can use a pre-built toolchain, but if you want to roll out your own cross-compiler, libc and binutils, there is a nice tool called Crosstool-NG. It greatly simplifies the process of building a cross-compiler optimized to generate code for a specific processor, so you're not stuck with a generic prebuilt toolchain, which usually builds code for a family of compatible processors (e.g. you could tune the toolchain for generating ASM for your specific target, or floating point code for a hardware FPU which is specific to your processor, instead of using only software floating point routines, which are default to most pre-built toolchains).
I'm using the standard gcc compiler in math software development with C-language. I don't know that much about compilers or compiler options, and I was just wondering, is it possible to make faster executables using another compiler or choosing better options? The default Makefile sets options -ffast-math and -O3 and I think both of them have some impact in the overall calculation time. My software is using memory quite extensively, so I imagine some options related to memory management might do the trick?
Any ideas?
Before experimenting with different compilers or random, arbitrary micro-optimisations, you really need to get a decent profiler and profile your code to find out exactly what the performance bottlenecks are. The actual picture may be very different from what you imagine it to be. Once you have a profile you can then start to consider what might be useful optimisations. E.g. changing compiler won't help you if you are limited by memory bandwidth.
Here are some tips about gcc performance:
do benchmarks with -Os, -O2 and -O3. Sometimes -O2 will be faster because it makes shorter code. Since you said that you use a lot of memory, try with -Os too and take measurements.
Also check out the -march=native option (it is considered safe to use, if you are making executable for computers with similar processors) on the client computer. Sometimes it can have considerable impact on performance. If you need to make a list of options gcc uses with native, here's how to do it:
Make a small C program called test.c, then
$ touch test.c
$ gcc -march=native -fverbose-asm -S test.c
$ cat test.s
credits for code goto Gentoo forums users.
It should print out a list of all optimizations gcc used. Please note that if you're using i7, gcc 4.5 will detect it as Atom, so you'll need to set -march and -mtune manually.
Also read this document, it will help you (still, in my experience on Gentoo, -march=native works better) http://gcc.gnu.org/onlinedocs/gcc/i386-and-x86_002d64-Options.html
You could try with new options in late 4.4 and early 4.5 versions such as -flto and -fwhole-program. These should help with performance, but when experimenting with them, my system was unstable. In any case, read this document too, it will help you understand some of GCC's optimization options http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
If you are running Linux on x86 then typically the Intel or PGI compilers will give you significantly faster performing executables.
The downsides are that there are more knobs to tune and that they come with a hefty price tag!
If you have specific hardware you can target your code for, the (hardware) company often releases paid-for compilers optimized for that hardware.
For example:
xlc for AIX
CC for Solaris
These compilers will generally produce better code optimization-wise.
As you say your program is memory heavy you could test to use a different malloc implementation than the one in standard library on your platform.
For example you could try the jemalloc (http://www.canonware.com/jemalloc/).
Keep in mind they most improvements to be had by changing compilers or settings will only get you proportional speedups where as adjusting algorithms you can sometimes get improvements in the O() of your program. Be sure to exhaust that before you put to much work into tweaking settings.
The product-group I work for is currently using gcc 3.4.6 (we know it is ancient) for a large low-level c-code base, and want to upgrade to a later version. We have seen performance benefits testing different versions of gcc 4.x on all hardware platforms we tested it on. We are however very scared of c-compiler bugs (for a good reason historically), and wonder if anyone has insight to which version we should upgrade to.
Are people using 4.3.2 for large code-bases and feel that it works fine?
The best quality control for gcc is the linux kernel. GCC is the compiler of choice for basically all major open source C/C++ programs. A released GCC, especially one like 4.3.X, which is in major linux distros, should be pretty good.
GCC 4.3 also has better support for optimizations on newer cpus.
When I migrated a project from GCC 3 to GCC 4 I ran several tests to ensure that behavior was the same before and after. Can you just run a run a set of (hopefully automated) tests to confirm the correct behavior? After all, you want the "correct" behavior, not necessarily the GCC 3 behavior.
I don't have a specific version for you, but why not have a 4.X and 3.4.6 installed? Then you could try and keep the code compiling on both versions, and if you run across a show-stopping bug in 4, you have an exit policy.
Use the latest one, but hunt down and understand each and every warning -Wall gives. For extra fun, there are more warning flags to frob. You do have an extensive suite of regression (and other) tests, run them all and check them.
GCC (particularly C++, but also C) has changed quite a bit. It does much better code analysis and optimization, and does handle code that turns out to invoke undefined bahaviiour differently. So code that "worked fine" but really did rely on some particular interpretation of invalid constructions will probably break. Hopefully making the compiler emit a warning or error, but no guarantee of such luck.
If you are interested in OpenMP then you will need to move to gcc 4.2 or greater. We are using 4.2.2 on a code base of around 5M lines and are not having any problems with it.
I can't say anything about 4.3.2, but my laptop is a Gentoo Linux system built with GCC 4.3.{0,1} (depending on when each package was built), and I haven't seen any problems. This is mostly just standard desktop use, though. If you have any weird code, your mileage may vary.