Which compilation flag should I use -> Os or O2? - c

I'm currently working on an embedded device application (in C). What optimization flag should I use for compiling that application keeping in mind that it only has 96 MB of RAM.
Also, please note that in this application, I'm basically pre-processing the JPEG image. So which optimization flag should I use ?
Also, does stripping this app will have any effect on efficiency and speed ?
The OS on which I'm running this app is Linux 2.6.37.

Generally optimization increases the binary size. Besides what effect this will have on speed is not predictable at all depending on the data set that you have. The only way is to benchmark the application with different set of flags, not just -O2 or -O3 but with other possible flags which you think may improve the performance of the application as I believe you'd have information on what the program does and how it might behave for different inputs.
The performance is dependent on the nature of the application, hence I don't think anyone can give you a convincing answer as to which flags can give you better performance.
Look at the GCC optimization flags and analyze your algorithm so as to find suitable flags and then decide which ones to use.

-Os is preferable. Not only are there RAM limits, but the CPU's cache size is limited as well, so -Os code can be faster executed in despite of using less optimization techniques.

Related

Will it be feasible to use gcc's function multi-versioning without code changes?

According to most benchmarks, Intel's Clear Linux is way faster than other distributions, mostly thanks to a GCC feature called Function Multi-Versioning. Right now the method they use is to compile the code, analyze which function contains vectorized loops, then patch the code with FMV attributes and compile it again.
How feasible will it be for GCC to do it automatically? For example, by passing -mmultiarch=sandybridge,skylake (or a similar -m option listing CPU extensions like AVX and AVX2).
Right now I'm interested in two usage scenarios:
Use this option for our large math-heavy program for delivering releases to our customers. I don't want to pollute the code with non-standard attributes and I don't want to modify the third-party libraries we use.
The other Linux distributions will be able to do this easily, without patching the code as Intel does. This should give all Linux users massive performance gains.
No, but it doesn't matter. There's very, very little code that will actually benefit from this; for the most part by doing it globally you'll just (without special effort to sort matching versions in pages together) make your system much more memory-constrained and slower due to the huge increase in code size. Most actual loads aren't even CPU-bound; they're syscall-overhead-bound, GPU-bound, IO-bound, etc. And many of the modern ones that are CPU-bound aren't running precompiled code but JIT'd code (i.e. everything running in a browser, whether that's your real browser or the outdated and unpatched fork of Chrome in every Electron app).

openMP and how to check cache behavior parallelism

I am using openmp to parallelize loops in my code in order to optimize it
I hear that openmp also shows something good or bad cache behavior
How do i see these cache interactions to arrange good cache behavior for my openmp omp pragma loop program?
OpenMP itself cannot be used to get information about the cache usage of your program. Depending on your platform there are some tools that will give insights to the cache behavior.
On Linux systems you can use perf.
perf stat -e cache-references,cache-misses <your-exe>
outputs statistics about the cache-misses. There are a lot more events that can be used (see here for further details). Common events are collected if you simply run:
perf stat <your-exe>
Another tool that can also be used for Windows is the IntelĀ® Performance Counter Monitor. Although it only works with Intel CPUs it can collect additional information like the occupied memory bandwidth (on supported models).
However, the tools can help you to measure the cache usage of your program, but did not improve it. You have to manually optimize your code and recheck if the cache misses have been reduced.
If you're looking to a specific kernel, you might want to consider [PAPI].1

Useful GCC flags to improve security of your programs?

By pure chance I stumbled over an article mentioning you can "enable" ASLR with -pie -fPIE (or, rather, make your application ASLR-aware). -fstack-protector is also commonly recommended (though I rarely see explanations how and against which kinds of attacks it protects).
Is there a list of useful options and explanations how they increase the security?
...
And how useful are such measures anyway, when your application uses about 30 libraries that use none of those? ;)
Hardened Gentoo uses these flags:
CFLAGS="-fPIE -fstack-protector-all -D_FORTIFY_SOURCE=2"
LDFLAGS="-Wl,-z,now -Wl,-z,relro"
I saw about 5-10% performance drop in comparison to optimized Gentoo linux (incl. PaX/SElinux and other measures, not just CFLAGS) in default phoronix benchmark suite.
As for your final question:
And how useful are such measures anyway, when your application uses about 30 libraries that use none of those? ;)
PIE is only necessary for the main program to be able to be loaded at a random address. ASLR always works for shared libraries, so the benefit of PIE is the same whether you're using one shared library or 100.
Stack protector will only benefit the code that's compiled with stack protector, so using it just in your main program will not help if your libraries are full of vulnerabilities.
In any case, I would encourage you not to consider these options part of your application, but instead part of the whole system integration. If you're using 30+ libraries (probably most of which are junk when it comes to code quality and security) in a program that will be interfacing with untrusted, potentially-malicious data, it would be a good idea to build your whole system with stack protector and other security hardening options.
Do keep in mind, however, that the highest levels of _FORTIFY_SOURCE and perhaps some other new security options break valid things that legitimate, correct programs may need to do, and thus you may need to analyze whether it's safe to use them. One known-dangerous thing that one of the options does (I forget which) is making it so the %n specifier to printf does not work, at least in certain cases. If an application is using %n to get an offset into a generated string and needs to use that offset to later write in it, and the value isn't filled in, that's a potential vulnerability in itself...
The Hardening page on the Debian wiki explains at least the most commons ones which are usable on Linux. Missing from your list is at least -D_FORTIFY_SOURCE=2, -Wformat, -Wformat-security, and for the dynamic loader the relro and now features.

Link-time optimization-enabled techniques and patterns?

Link-time optimization (LTO) (a.k.a. unity build) is included in GCC 4.5 or later and other compilers have similar optimization passes. Doesn't this make certain code patterns much more viable than before?
For example, for maximum performance a "module" of C code often needs to expose its guts. Does LTO make this obsolete? What code patterns are now viable that were not before?
I believe that LTO is simply an optimization, but not necessarily one that obviates the need for documentation of implemenation ("exposing the guts") of any module. Whole languages have been written to that effect; I do not think C will have that need removed from it soon, or perhaps ever.
From the description of the LTO feature in gcc:
Link Time Optimization (LTO) gives GCC the capability of dumping its
internal representation (GIMPLE) to disk, so that all the different
compilation units that make up a single executable can be optimized as
a single module. This expands the scope of inter-procedural
optimizations to encompass the whole program (or, rather, everything
that is visible at link time).
From the announcement of LTO's inclusion into gcc:
The result should, in principle, execute faster but our IPA cost
models are still not tweaked for LTO. We've seen speedups as well as
slowdowns in benchmarks (see the LTO testers at
http://gcc.opensuse.org/).

Faster code with another compiler

I'm using the standard gcc compiler in math software development with C-language. I don't know that much about compilers or compiler options, and I was just wondering, is it possible to make faster executables using another compiler or choosing better options? The default Makefile sets options -ffast-math and -O3 and I think both of them have some impact in the overall calculation time. My software is using memory quite extensively, so I imagine some options related to memory management might do the trick?
Any ideas?
Before experimenting with different compilers or random, arbitrary micro-optimisations, you really need to get a decent profiler and profile your code to find out exactly what the performance bottlenecks are. The actual picture may be very different from what you imagine it to be. Once you have a profile you can then start to consider what might be useful optimisations. E.g. changing compiler won't help you if you are limited by memory bandwidth.
Here are some tips about gcc performance:
do benchmarks with -Os, -O2 and -O3. Sometimes -O2 will be faster because it makes shorter code. Since you said that you use a lot of memory, try with -Os too and take measurements.
Also check out the -march=native option (it is considered safe to use, if you are making executable for computers with similar processors) on the client computer. Sometimes it can have considerable impact on performance. If you need to make a list of options gcc uses with native, here's how to do it:
Make a small C program called test.c, then
$ touch test.c
$ gcc -march=native -fverbose-asm -S test.c
$ cat test.s
credits for code goto Gentoo forums users.
It should print out a list of all optimizations gcc used. Please note that if you're using i7, gcc 4.5 will detect it as Atom, so you'll need to set -march and -mtune manually.
Also read this document, it will help you (still, in my experience on Gentoo, -march=native works better) http://gcc.gnu.org/onlinedocs/gcc/i386-and-x86_002d64-Options.html
You could try with new options in late 4.4 and early 4.5 versions such as -flto and -fwhole-program. These should help with performance, but when experimenting with them, my system was unstable. In any case, read this document too, it will help you understand some of GCC's optimization options http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
If you are running Linux on x86 then typically the Intel or PGI compilers will give you significantly faster performing executables.
The downsides are that there are more knobs to tune and that they come with a hefty price tag!
If you have specific hardware you can target your code for, the (hardware) company often releases paid-for compilers optimized for that hardware.
For example:
xlc for AIX
CC for Solaris
These compilers will generally produce better code optimization-wise.
As you say your program is memory heavy you could test to use a different malloc implementation than the one in standard library on your platform.
For example you could try the jemalloc (http://www.canonware.com/jemalloc/).
Keep in mind they most improvements to be had by changing compilers or settings will only get you proportional speedups where as adjusting algorithms you can sometimes get improvements in the O() of your program. Be sure to exhaust that before you put to much work into tweaking settings.

Resources