Why is the gcc math library so inefficient? - c

When I was porting some fortran code to c, it surprised me that the most of the execution time discrepancy between the fortran program compiled with ifort (intel fortran compiler) and the c program compiled with gcc, comes from the evaluations of trigonometric functions (sin, cos). It surprised me because I used to believe what this answer explains, that functions like sine and cosine are implemented in microcode inside microprocessors.
In order to spot the problem more explicitly I made a small test program in fortran
program ftest
implicit none
real(8) :: x
integer :: i
x = 0d0
do i = 1, 10000000
x = cos (2d0 * x)
end do
write (*,*) x
end program ftest
On intel Q6600 processor and 3.6.9-1-ARCH x86_64 Linux
I get with ifort version 12.1.0
$ ifort -o ftest ftest.f90
$ time ./ftest
-0.211417093282753
real 0m0.280s
user 0m0.273s
sys 0m0.003s
while with gcc version 4.7.2 I get
$ gfortran -o ftest ftest.f90
$ time ./ftest
0.16184945593939115
real 0m2.148s
user 0m2.090s
sys 0m0.003s
This is almost a factor of 10 difference! Can I still believe that the gcc implementation of cos is a wrapper around the microprocessor implementation in a similar way as this is probably done in the intel implementation? If this is true, where is the bottle neck?
EDIT
According to comments, enabled optimizations should improve the performance. My opinion was that optimizations do not affect the library functions ... which does not mean that I don't use them in nontrivial programs. However, here are two additional benchmarks (now on my home computer intel core2)
$ gfortran -o ftest ftest.f90
$ time ./ftest
0.16184945593939115
real 0m2.993s
user 0m2.986s
sys 0m0.000s
and
$ gfortran -Ofast -march=native -o ftest ftest.f90
$ time ./ftest
0.16184945593939115
real 0m2.967s
user 0m2.960s
sys 0m0.003s
Which particular optimizations did you (commentators) have in mind? And how can compiler exploit a multi-core processor in this particular example, where each iteration depends on the result of the previous one?
EDIT 2
The benchmark tests of Daniel Fisher and Ilmari Karonen made me think that the problem might be related to the particular version of gcc (4.7.2) and maybe to a particular build of it (Arch x86_64 Linux) that I am using on my computers. So I repeated the test on the intel core i7 box with debian x86_64 Linux, gcc version 4.4.5 and ifort version 12.1.0
$ gfortran -O3 -o ftest ftest.f90
$ time ./ftest
0.16184945593939115
real 0m0.272s
user 0m0.268s
sys 0m0.004s
and
$ ifort -O3 -o ftest ftest.f90
$ time ./ftest
-0.211417093282753
real 0m0.178s
user 0m0.176s
sys 0m0.004s
For me this is a very much acceptable performance difference, which would never make me ask this question. It seems that I will have to ask on Arch Linux forums about this issue.
However, the explanation of the whole story is still very welcome.

Most of this is due to differences in the math library. Some points to consider:
Yes, the x86 processors with the x87 unit has fsin and fcos instructions. However, they are implemented in microcode, and there is not particular reason why they must be faster than a pure software implementation.
GCC does not have it's own math library, but rather uses the system provided one. On Linux this is typically provided by glibc.
32-bit x86 glibc uses fsin/fcos.
x86_64 glibc uses software implementations using the SSE2 unit. For a long time, this was a lot slower than the 32-bit glibc version which just used the x87 instructions. However, improvements have (somewhat recently) been made, so depending on which glibc version you have the situation might not be as bad anymore as it used to be.
The Intel compiler suite is blessed with a VERY fast math library (libimf). Additionally, it includes vectorized transcendental math functions, which can often further speed up loops with these functions.

Related

Binary doesn't work and IR doesn't have any characteristic forking calls

I was trying to run openmp code with clang 10.0.0 using libgomp.
Here is what I did to run the code using libomp (I need to see the LLVM-IR too)
clang -Xclang -cc1 file.c -emit-llvm -S -fopenmp=libomp
clang -fopenmp=libomp file.ll
Here, the binary works as expected (in parallel) and the IR has characteristic #__kmpc_fork_call function.
When I try the same using libgomp
clang -Xclang -cc1 file.c -emit-llvm -S -fopenmp=libgomp
clang -fopenmp=libgomp file.ll
In this case, neither the binary works as expected (only one thread runs) nor does the IR have any characteristic forking calls. Am I doing something wrong here?
AFAIK, this feature is sadly not really supported so far. The last version of Clang (12.0) clearly does not generate GOMP function calls while it does with its own OpenMP runtime.
See this reported bug for more information. Developers say:
I don't think -fopenmp=libgomp is functional at all. [...] -fopenmp=libgomp is just like you compile without -fopenmp. [...] Clang uses only libomp interface and does not rely on libgomp.

how can there be such a big memory difference with -static compilation command?(C)

I am working on a task for the university, there is a webiste that checks my memory usage and it compiles the .c files with:
/usr/bin/gcc -DEVAL -std=c11 -O2 -pipe -static -s -o program programname.c -lm
and it says my program exceeds the memory limits of 4 Mib which is a lot i think. I was told this command makes it use more memory that the standard compilation I use on my pc, like this:
gcc myprog.c -o myprog
I launched the executable created by this one compilation with:
/usr/bin/time -v ./myprog
and under "maximum resident set size" it says 1708 kilobytes, which should be 1,6 Mibs. So how can it be that for the university checker my program goes over 4 Mibs? I have eliminated all the possible mallocs i have, I just left the essential ones but it still says it goes over the limit, what else should I improve? I'm almost thinking the wesite has an error or something...
From GNU GCC Manual, Page 197:
-static On systems that support dynamic linking, this overrides ‘-pie’ and prevents linking with the shared libraries. On other systems, this
option has no effect.
If you don't know about the pie flag quoted here, have a look at this section:
-pie Produce a dynamically linked position independent executable on targets that support it. For predictable results, you must also
specify the same set of options used for compilation (‘-fpie’,
‘-fPIE’, or model suboptions) when you specify this linker option.
To answer your question: yes is it possible this overhead generated by the static flag, because in that case, the compiler can not do the basic optimization by merging stdlib's code with the one you've produced.
As it was suggested in the comments you shall compile your code with the same flag of the website to have an idea of the real overhead of your program (be sure that your gcc version is the same of the website) and also you shall do some common manual optimization such constant folding, function inlining etc. A good reference to these optimizations could be this one

why my executable is bigger after linking

I have a small C program which I need to run on different chips.
The executable should be smaller than 32kb.
For this I have several toolchains with different compilers for arm, mips etc.
The program consists of several files which each is compiled to an object file and then linked together to an executable.
When I use the system gcc (x86) my executable is 15kb big.
With the arm toolchain the executable is 65kb big.
With another toolchain it is 47kb.
For example for arm all objects which are included in the executable are 14kb big together.
The objects are compiled with the following options:
-march=armv7-m -mtune=cortex-m3 -mthumb -msoft-float -Os
For linking the following options are used:
-s -specs=nosys.specs -march-armv7-m
The nosys.specs library is 274 bytes big.
Why is my executable still so much bigger (65kb) when my code is only 14kb and the library 274 bytes?
Update:
After suggestions from the answer I removed all malloc and printf commands from my code and removed the unused includes. Also I added the compile flags -ffunction-sections -fdata-sections and linking flag --gc-sections , but the executable is still too big.
For experimenting I created a dummy program:
int main()
{
return 1;
}
When I compile the program with different compilers I get very different executable sizes:
8.3 KB : gcc -Os
22 KB : r2-gcc -Os
40 KB : arm-gcc --specs=nosys.specs -Os
1.1 KB : avr-gcc -Os
So why is my arm-gcc executable so much bigger?
The avr-gcc executable does static linking as well, I guess.
Your x86 executable is probably being dynamically linked, so any standard library functions you use -- malloc, printf, string and math functions, etc -- are not included in the binary.
The ARM executable is being statically linked, so those functions must be included in your binary. This is why it's larger. To make it smaller, you may want to consider compiling with -ffunction-sections -fdata-sections, then linking with --gc-sections to discard any unused functions or data from your binary.
(The "nosys.specs library" is not a library. It's a configuration file. The real library files are elsewhere.)
The embedded software porting depends on target hardware and software platform.
The hardware platforms are splitted into mcu, cpus which could run linux os.
The software platform includes compiler toolchain and libraries.
It's meaningless to compare the program image size for mcu and x86 hardware platform.
But it's worth to compare the program image size on the same type of CPU using different toolchains.

Why assembly produced by objdump is huge?

I am trying to view the assembly for my simple C application. So, I have tried to produce assembly from binary by using objdump and it produces about 4.3MB sized file with 103228 lines of assembly code. Then, I have tried to do so by providing -S & -save-temps flags to the gcc.
I have used the following three commands:
1. arm-linux-gnueabi-objdump -d hello_simple > hello_simple.dump
2. arm-linux-gnueabi-gcc -save-temps -static hello_simple.c -o hello_simple -lm
3. arm-linux-gnueabi-gcc -S -static hello_simple.c -o hello_simple.asm -lm
In case of 2 & 3, exactly same results are produced, i.e., 65 lines of assembly code. I understand objdump produces some extra details too.
But, why is there a huge difference?
EDIT1: I have used the following command to build that binary:
arm-linux-gnueabi-gcc -static hello_simple.c -o hello_simple -lm
EDIT2: Though, -static and -lm flags may look here unnecessary but, I have to execute this binary on simulator after compile time additions of some assembly components, making them a must.
So, which assembly code should I consider as the most relevant during my analysis of execution traces? (I know it's another question but it would be handy to cover it in the same answer.)
The second two are just saving the asm for your functions.
The first one also has the CRT startup code. And, since you statically linked it, all the library functions you called.
Note that for 3, -static and -lm don't do anything, because you're not linking. gcc foo.c -S -O3 -fverbose-asm -o- | less is often handy.
I notice that none of your command lines included a -O3, or a -march=. You should compile with optimization on, and have gcc optimize your code for the target hardware.
.s is the standard suffix for machine-generated asm. (.S for hand-written asm: gcc foo.S will run it through cpp first). gcc -S produces a .s, the same way -c produces a .o.
For x86, .asm is usually only used for Intel-syntax (NASM/YASM), but IDK what the conventions are for ARM.
So, which assembly code should I consider as the most relevant during my analysis of execution traces?
It depends what you're trying to learn! If you have a good sense of how "expensive" each library function call is (in terms of number of instructions, number of branches polluting the branch-predictors, and data-cache pollution), then you don't need to trace execution through library calls. If you have math library functions that are used from some of your inner loops, then it's worth looking at them if the code is time-critical.
Usually a profiler or single-stepping in a debugger is useful for that, though. Just having disassembly output of a lot of library code is usually just clutter.

How to optimize the executable generated by a C compilation in Linux terminal

Is there any way to reduce the memory used by an executable generated with a command like gcc source_file.c -o result? I browsed the Internet and also looked in the man page for "gcc" and I think that I should use something related to -c or -S. So is gcc -c -S source_file.c -o result working? (This seems to reduce the space used...is there any other way to reduce even more?)
Thanks,
Polb
The standard compiler option on POSIX-like systems to instruct the compiler to optimize is -O (capital letter O for optimize). Many compilers allow you to optionally specify an optimization level after -O. Common optimization levels include:
-O0 no optimization at all
-O1 basic optimization for speed
-O2 all of -O1 plus some advanced optimizations
-O3 all of -O2 plus expensive optimizations that aren't usually needed
-Os optimize for size instead of speed (gcc, clang)
-Oz optimize even more for size (clang)
-Og all of -O2 except for optimizations that hinder debugging (gcc)
-Ofast all of -O3 and some numeric optimizations not in conformance with standard C. Use with caution. (gcc)
option -S generate assembler output.
Better option is using llvm and generating asembler for multi architecture. Similar http://kripken.github.io/llvm.js/demo.html
llc
here is example https://idea.popcount.org/2013-07-24-ir-is-better-than-assembly/

Resources