I am using an ARM cortex A9 platfrom to measure the performance of some algorithms. More specifically i measure the execution time of one algorithm using the clock() function (time.h). I use the latter function right before calling my algorithm and right after the algorithm returns.
....
....
....
start=clock();
alg();
end=clock();
...
...
...
Then I compile the code with exactly the same options and i produce two different object files. The first one is named n and the second one nn. On the ARM platform i run my code in one core. All the other tasks'affinity is set to the other cores. Object file n returns 0,12sec while Object file nn returns 0.1sec. I compared the two binaries files and they don't differences. I noticed that if I give a name to the object file larger than 1 letter then I always have less execution time for my algorithm. Moreover if I run the n.c file and then rename it and run it again I will also get different performance numbers.
Could you please give me some ideas why something like this happens? Thanks in advance
P.S.1: I am using gcc 4.8.1 cross compiler.
P.S.2: I compile my code with
arm-none-linux-gnueabi-gcc -mthumb -march=armv7-a -mtune=cortex-a9 -mcpu=cortex-a9 -mfloat- abi=softfp -mfpu=neon -Ofast code.c -o n
Related
I have to complete a project for university where I need to be able to optimise using compiler optimisation levels.
I am using OpenMP and as a result I have got the gcc-11 compiler from brew.
I have watched this video https://www.youtube.com/watch?v=U161zVjv1rs and tried the same thing but I am getting an error:
gcc-11 -fopenmp jacobi2d1.c -o1 out/jacobi2d1
But I am getting the following error:
How do I do this?
Any advice would be appreciated
Optimization levels are specified with -O1 etc, using capital letter O, not lower-case letter o.
Lower-case -o1 specifies that the output file should be 1, and then out/jacobi2d1 is an input file to be linked, but it is an existing executable and you can't link one executable into another — hence the error from the linker.
I have a Brainfuck interpreter project with two source files, altering the order that the source files are given as operands to Clang, and nothing else, results in consistent performance differences.
I am using Clang, with the following arguments:
clang -I../ext -D VERSION=\"1.0.0\" main.c lex.c
clang -I../ext -D VERSION=\"1.0.0\" lex.c main.c
Performance differences are seen regardless of optimisation level.
Benchmark results:
-O0 lex before main: 13.68s, main before lex: 13.02s
-01 lex before main: 6.91s, main before lex: 6.65s
-O2 lex before main: 7.58s, main before lex: 7.50s
-O3 lex before main: 6.25s, main before lex: 7.40s
Which order performs worse is not always consistent between the optimisation levels, but for each level, the same operand order always performs worse than the other.
Notes:
Source code can be found here.
The mandelbrot benchmark that I am using with the interpreter can be found here.
Edits:
The executable files for each optimisation level are exactly the same size, but are structured differently.
Object files are identical with either operand order.
The I/O and parsing process is indistinguishably quick regardless of operand order, even running a 500 MiB random file through it results in no variation, thus performance variation is occurring in the run loop.
Upon comparing the objdump of each executable, it appears to me that the primary, if not the only, difference is the order of the sections (, , etc), and the memory addresses that have changed because of this.
The objdumps can be found here.
I don't have a complete answer. But I think I know what's causing the differences between linkage ordering.
First, I got the similar results. I'm using gcc on cygwin. Some sample runs:
Building like this:
$ gcc -I../ext -D VERSION=\"1.0.0\" main.c lex.c -O3 -o mainlex
$ gcc -I../ext -D VERSION=\"1.0.0\" lex.c main.c -O3 -o lexmain
Then running (multiple times to confirm, but here's a sample run)
$ time ./mainlex.exe input.txt > /dev/null
real 0m7.377s
user 0m7.359s
sys 0m0.015s
$ time ./lexmain.exe input.txt > /dev/null
real 0m6.945s
user 0m6.921s
sys 0m0.000s
Then I noticed these declarations:
static char arr[30000] = { 0 }, *ptr = arr;
static tok_t **dat; static size_t cap, top;
And that caused me to recognize that 30K of a zero byte array is getting inserted into the linkage of the program. That might introduce a page load hit. And the linkage ordering might influence if code in main is within the same page as the functions in lex. Or just accessing array means jumping between a page that isn't in the cache anymore. Or some combination thereof. It was just a hypothesis, not a theory.
So I moved the declarations of these global directly into main and dropped the static declaration. Kept the zero-init on the variables.
int main(int argc, char *argv[]) {
char arr[30000] = { 0 }, *ptr = arr;
tok_t **dat=NULL; size_t cap=0, top=0;
That will certainly shrink the object code and binary size by 30K and the stack allocation should be near instant.
I get near identical perf when I run it both ways. As a matter of fact, both builds run faster.
$ time ./mainlex.exe input.txt > /dev/null
real 0m6.385s
user 0m6.359s
sys 0m0.015s
$ time ./lexmain.exe input.txt > /dev/null
real 0m6.353s
user 0m6.343s
sys 0m0.015s
I'm not an expert on page sizes, code paging, or even how the linker and loader operate. But I do know that global variables, including that 30K array, are expanded into the object code directly (thus increasing the object code size itself) and are effectively part of the binaries final image. And smaller code is often faster code.
That 30K buffer in global space might be introducing a sufficiently large enough number of bytes between the functions in lex, main and the c-runtime itself to effect the way code gets paged in and out. Or just causing the loader to take longer to load the binary.
In other words, globals cause code bloat and increase object size. By moving the array declaration to the stack, the memory allocation is near instant. And now the linkage of lex and main probably fits within the same page in memory. Further, because the variables are on the stack, the compiler can likely take more liberty with optimizations.
So in other words, I think I found the root cause. But I'm not 100% sure as to why. There is not a whole lot of function invocations getting made. So it's not like the instruction pointer is jumping around a lot between code in lex.o and code in main.o such that the cache is having to reload the page.
A better test might be to find a much larger input file that triggers a longer run. That way, we can see if the runtime delta is fixed or linear between the two original builds.
Any more insight will require doing some some actual code profiling, instrumentation, or binary analysis.
I've been looking for a C profiler for Windows, that will allow me to inspect time spent in the level of source-code lines, as opposed to just at the level of functions. This is in order to find hotspots in the program that can be optimized.
Very Sleepy looks great for this purpose. However, in the Source view, it doesn't seem that the number of time spent per line of code actually adds up to the 100% of Exclusive time for the function.
For example, Very Sleepy says we spent 18.50s Exclusive time in the function. But adding up all of the time durations specified in the Source view for that function, only adds up to about 10s.
This is how I compile the program:
gcc -IC:/msys64_new/mingw64/include *.c -o plane.exe -g -gdwarf-2 -fno-omit-frame-pointer -O2 -Wall -Wno-unused -LC:/msys64_new/mingw64/lib -lShlwapi
I then open Very Sleepy through the GUI and sample the running process for exactly 100 seconds.
I'm using Very Sleepy CS 0.90. I'm running Windows 7 and using the Mingw-w64 subsystem of MSYS2.
EDIT:
I've also noticed two additional weird things. First of all, Very Sleepy displays some functions without their name, but does recognize them as part of the profiled module.
Secondly, Very Sleepy seems to think a few variables are actually functions. For example:
extension_module_file_suffix is not a function, it's a variable. What's going on?
I am using perf to profile my program, which involves loads of use of exp() and pow(). The code was compiled use
icc -g -fno-omit-frame-pointer test.c
and profiled with:
perf record -g ./a.out
which is followed by:
perf report -g 'graph,0.5,caller'
and perf gave:
the two functions __libm_exp_l9() and __libm_pow_l9() are consuming considerable amount of computational power.
So I am wondering if they are just alias to exp() and pow(), respectively? Or any suggestions to read in the report here?
Thanks.
They are not aliases, but internal implementation of the functions. Mathematical libraries have usually several versions of the functions depending on used processor, instruction set, or arguments.
There is nothing to worry about. Exp and Pow are functions that are more complex than just an instructions (usually) and therefore they take some time. Unfortunately I didn't find any reference to them (Intel mathematical library is probably not opensource), but this is common practice to use internal, namespaced names for function.
I will ask my question by giving an example. Now I have a function called do_something().
It has three versions: do_something(), do_something_sse3(), and do_something_sse4(). When my program runs, it will detect the CPU feature (see if it supports SSE3 or SSE4) and call one of the three versions accordingly.
The problem is: When I build my program with GCC, I have to set -msse4 for do_something_sse4() to compile (e.g. for the header file <smmintrin.h> to be included).
However, if I set -msse4, then gcc is allowed to use SSE4 instructions, and some intrinsics in do_something_sse3() is also translated to some SSE4 instructions. So if my program runs on CPU that has only SSE3 (but no SSE4) support, it causes "illegal instruction" when calls do_something_sse3().
Maybe I have some bad practice. Could you give some suggestions? Thanks.
I think that the Mystical's tip is fine, but if you really want to do it in the one file, you can use proper pragmas, for instance:
#pragma GCC target("sse4.1")
GCC 4.4 is needed, AFAIR.
I think you want to build what's called a "CPU dispatcher". I got one working (as far as I know) for GCC but have not got it to work with Visual Studio.
cpu dispatcher for visual studio for AVX and SSE
I would check out Agner Fog's vectorclass and the file dispatch_example.cpp
http://www.agner.org/optimize/#vectorclass
g++ -O3 -msse2 -c dispatch_example.cpp -od2.o
g++ -O3 -msse4.1 -c dispatch_example.cpp -od5.o
g++ -O3 -mavx -c dispatch_example.cpp -od8.o
g++ -O3 -msse2 instrset_detect.cpp d2.o d5.o d8.o
Here is an example of compiling a separate object file for each optimization setting:
http://notabs.org/lfsr/software/index.htm
But even this method fails when gcc link time optimization (-flto) is used. So how can a single executable be built with full optimization for different processors? The only solution I can find is to use include directives to make the C files behave as a single compilation unit so that -flto is not needed. Here is an example using that method:
http://notabs.org/blcutil/index.htm
If you are using GCC 4.9 or above on an i686 or x86_64 machine, then you are supposed to be able to use intrinsics regardless of your -march=XXX and -mXXX options. You could write your do_something() accordingly:
void do_something()
{
byte temp[18];
if (HasSSE2())
{
const __m128i i = _mm_loadu_si128((const __m128i*)(ptr));
...
}
else if (HasSSSE3())
{
const __m128i MASK = _mm_set_epi8(12,13,14,15, 8,9,10,11, 4,5,6,7, 0,1,2,3);
_mm_storeu_si128(reinterpret_cast<__m128i*>(temp),
_mm_shuffle_epi8(_mm_loadu_si128((const __m128i*)(ptr)), MASK));
}
else
{
// Do the byte swap/endian reversal manually
...
}
}
You have to supply HasSSE2(), HasSSSE3() and friends. Also see Intrinsics for CPUID like informations?.
Also see GCC Issue 57202 - Please make the intrinsics headers like immintrin.h be usable without compiler flags. But I don't believe the feature works. I regularly encounter compile failures because GCC does not make intrinsics available.