Eliminating redundant loads of GOT register?

Eliminating redundant loads of GOT register? - c

I'm dealing with some code that's getting 70-80% slower when compiled as PIC (position independent code), and looking for ways to alleviate the problem. A big part of the problem is that gcc insists on inserting the following in every single function:
call __i686.get_pc_thunk.bx
addl $_GLOBAL_OFFSET_TABLE_,%ebx
even if that ends up being 20% of the content of the function. Now, ebx is a call-preserved register, and every function in the relevant translation unit (source file) is loading it with the address of the GOT, and it's easily detectable that the static functions cannot be called from outside the translation unit (their addresses are never taken). So why can't gcc just load ebx once at the beginning of the big external-linkage functions, and generate the static-linkage functions so that they assume ebx has already been loaded with the address of the GOT? Is there any optimization flag I can use to force gcc to make this obvious and massive optimization, short of turning the inline limits up sky-high so everything gets inlined into the external functions?

There is probably no generic cure for this, but you could try to play around with inlining options. I'd guess that static functions in a compilation unit don't have too many callers, so the overhead in code replication wouldn't be too bad.
The easiest way to force such things with gcc would be to set an attribute((always_inline)). You could play around with a gcc dependent macro to ensure portability.
If you don't want to modify the code (but static inline would be good anyhow) you could use the -finline-limit option to fine tune that.

Not really a solution, but: if the functions in question do not reference file-scope variables, you could put them all together in a single translation unit and compile it without -fPIC flag. Then you link them together with other files in the final SO as usual.

Related

Why is optimizing inline functions easier than normal functions?

Im reading What Every Programmer Should Know About Memory
https://people.freebsd.org/~lstewart/articles/cpumemory.pdf and it says that inline functions make your code more optimizable
for example:
Inlining of functions, in particular, allows the compiler to optimize larger chunks of code at a time which, in turn, enables the generation of machine code which better exploits the processor’s pipeline architecture
and:
The handling of both code and data (through dead code elimination or value range propagation, and others) works better when larger parts of the program can be considered as a single unit.
and this also:
If a function is only called once it might as well be inlined. This gives the compiler the opportunity to perform more optimizations (like value range propagation, which might significantly improve the code).
After reading these, to me atleast it seems like inline functions are easier to optimize, but why? Why is it easier to optimize something is inline?

The reason that it is easier to make a better job when optimizing inlined functions than with outlined is that you know the complete context in which the function is called and can use this information to tune the generated code to match this exact context. This often allows more efficient code for the inlined copy of the function but also for the calling function. The caller and the callee can be optimized to fit each other in a way that is not possible for outlined functions.

There is no difference!
All functions are subject to being inlined by gcc in -O3 optimization mode, whether declared inline, static, neither or both.
see: https://stackoverflow.com/a/40783656/9925764
or here is the modifying the example of #Eugene Sh. without noinline option.
https://godbolt.org/z/arPEf7rd4

Functions splitting effect on running time

I am writing a DSP code in C (windows environment). The code should be modified, by another engineer, to run on Cortex-M4. This engineer claims that, for reduction of running time, many of the functions that I have implemented should be united into one function. I prefer to avoid it keeping clarity and testing.
Does his claim make sense? If it is, where I can read about it. Otherwise, can I show that he is wrong without a comparison of running time?

Does his claim make sense?
Depends on context. Modern compilers are perfectly able to inline function calls, but that usually means that those functions must be placed in the same translation unit (essentially the same .c file).
If your functions are in the same .c file then their claim is wrong, if you have the functions scattered across multiple files, then their claim is likely correct.
If it is, where I can read about it.
Function inlining has been around for some 30 years. C even added an inline keyword for it in year 1999 (C++ had one earlier still), though during the 2000s compilers turned smarter than programmers in terms of determining when and what to inline. Nowadays when using modern compilers, inline is mostly considered obsolete.
Otherwise, can I show that he is wrong without a comparison of running time?
By disassembling the optimized code and see if there are any function calls or not. Still, function calls are relatively cheap on Cortex M (unless there's a ton of different parameters), so doing manual optimization to remove them would be very tiny optimization.

As always there's a choice between code size and execution speed.
If you wish to remove the stack overhead of calling a new function but wish to keep your code modular then consider using the inline function attribute suitable for your compiler e.g.
static inline void com_ClearMessageBuffer(uint8_t* pBuffer, uint32_t length)
{
NRF_LOG_DEBUG("com_ClearMessageBuffer");
memset(pBuffer, 0, length);
}
Then at compile time your inline function code will be inserted into the code flow wherever it is called.
This will speed execution, but when called multiple times increase the code size.

Tell C to inline function but still have it available to call debugger

Is there anyway to mark a function as inline, but still have at available to call at the debugger? All of my functions I would like to call are marked as static inline, because we are only allowed to expose certain functions in our file. I'm using gcc.

-ginline-points could help:
Generate extended debug information for inlined functions. Location view tracking markers are inserted at inlined entry points, so that address and view numbers can be computed and output in debug information. This can be enabled independently of location views, in which case the view numbers won’t be output, but it can only be enabled along with statement frontiers, and it is only enabled by default if location views are enabled.

In-lined functions have no return instruction, so even if you had the address of the start of the in-lined function, calling it from the debugger would execute the code that follows the in-lining, of which it would almost certainly not have a suitable stack frame.
It is not usual, and certainly not easy do debug optimised code in any case. Normally one would simply switch optimisation off for debugging - in GCC at least, the inline keyword is ignored at -O0.

This is one of the issues when optimizing the code. You need to lower the optimizations a little bit (Usual recommendations in CMake for instance is to use -O2 instead of -O3) and add -fno-omit-frame-pointer to the command line (it will make code slower, as it allocates a register to keep track on the stack frame pointer during function calls).
On compilers like ICC, you can have even more debug info by using -debug all.

Is there, as in JavaScript, a performance penalty for creating functions in C?

In JavaScript, there are, often, huge performance penalties for writing functions. For example, if you use this function:
function double(x){ return x*2; }
inside an inner loop, you are probably hitting your performance considerably, so it is really profitable to inline that kind of function for intensive applications. Does this, in general, hold for C? Am I free to create those kind of functions for everything, and rest assured the compiler will do the job, or is hand inlining still important?

The answer is: it depends.
I'm currently using MSVC compiler and GCC for a project at work and my experience is that they both do a pretty good job. Furthermore, the cost of a function call in native code can be pretty small, especially in functions that do not need to be accessible outside the executable (like functions not exported in a shared library). For these functions, there is more flexibility with how the call is actually implemented.
A few things to note: it's much easier for a compiler to optimize calls to static functions. Functions with external linkage often require link time optimization since one must know how and where the function is actually called, as well as the implementation, to do much optimization or inlining. This requires examining more than one compilation unit at a time.
I would say that you should use functions where it makes sense and makes the code easier to read and maintain. In general, it is safe to assume that the cost is smaller than it would be in JavaScript. But in the end, you'd have to profile the code to say anything more precise.
UPDATE: I want to emphasize that functions can be inlined across compilation units, but this requires link-time optimization (or whole program optimization). This is supported in both GCC (https://gcc.gnu.org/wiki/LinkTimeOptimization) and MSVC (http://msdn.microsoft.com/en-us/library/0zza0de8.aspx).
These days, if you can beat the compiler by copying the body of a function and pasting it everywhere you call that function, you probably need a different compiler.

In general, with optimizations turned on, gcc will tend to inline short functions provided that they are defined in the same compilation unit that they are called in.
Moreover, if the calling function and called function are in different compilation units, the compiler does not have a chance to inline them regardless of what you request.
So, if you want to maximize the chance of the compiler optimizing away a function call (without manually inlining), you should define the function call in .h file or in the same c file that it is called in.

There are no inner functions in C. Dot. So the rest of your question is kind of irrelevant.
Anyway, as of "normal" functions in C compiler may or may not inline them ( replace function invocation by its body ). If you compile your code with "optimize for size" it may decide to do not do inlining for obvious reason.

CPU dependent code: how to avoid function pointers?

I have performance critical code written for multiple CPUs. I detect CPU at run-time and based on that I use appropriate function for the detected CPU. So, now I have to use function pointers and call functions using these function pointers:
void do_something_neon(void);
void do_something_armv6(void);
void (*do_something)(void);
if(cpu == NEON) {
do_something = do_something_neon;
}else{
do_something = do_something_armv6;
}
//Use function pointer:
do_something();
...
Not that it matters, but I'll mention that I have optimized functions for different cpu's: armv6 and armv7 with NEON support. The problem is that by using function pointers in many places the code become slower and I'd like to avoid that problem.
Basically, at load time linker resolves relocs and patches code with function addresses. Is there a way to control better that behavior?
Personally, I'd propose two different ways to avoid function pointers: create two separate .so (or .dll) for cpu dependent functions, place them in different folders and based on detected CPU add one of these folders to the search path (or LD_LIB_PATH). The, load main code and dynamic linker will pick up required dll from the search path. The other way is to compile two separate copies of library :)
The drawback of the first method is that it forces me to have at least 3 shared objects (dll's): two for the cpu dependent functions and one for the main code that uses them. I need 3 because I have to be able to do CPU detection before loading code that uses these cpu dependent functions. The good part about the first method is that the app won't need to load multiple copies of the same code for multiple CPUs, it will load only the copy that will be used. The drawback of the second method is quite obvious, no need to talk about it.
I'd like to know if there is a way to do that without using shared objects and manually loading them at runtime. One of the ways would be some hackery that involves patching code at run-time, it's probably too complicated to get it done properly). Is there a better way to control relocations at load time? Maybe place cpu dependent functions in different sections and then somehow specify what section has priority? I think MAC's macho format has something like that.
ELF-only (for arm target) solution is enough for me, I don't really care for PE (dll's).
thanks

You may want to lookup the GNU dynamic linker extension STT_GNU_IFUNC. From Drepper's blog when it was added:
Therefore I’ve designed an ELF extension which allows to make the decision about which implementation to use once per process run. It is implemented using a new ELF symbol type (STT_GNU_IFUNC). Whenever the a symbol lookup resolves to a symbol with this type the dynamic linker does not immediately return the found value. Instead it is interpreting the value as a function pointer to a function that takes no argument and returns the real function pointer to use. The code called can be under control of the implementer and can choose, based on whatever information the implementer wants to use, which of the two or more implementations to use.
Source: http://udrepper.livejournal.com/20948.html
Nonetheless, as others have said, I think you're mistaken about the performance impact of indirect calls. All code in shared libraries will be called via a (hidden) function pointer in the GOT and a PLT entry that loads/calls that function pointer.

For the best performance you need to minimize the number of indirect calls (through pointers) per second and allow the compiler to optimize your code better (DLLs hamper this because there must be a clear boundary between a DLL and the main executable and there's no optimization across this boundary).
I'd suggest doing these:
moving as much of the main executable's code that frequently calls DLL functions into the DLL. That'll minimize the number of indirect calls per second and allow for better optimization at compile time too.
moving almost all your code into separate CPU-specific DLLs and leaving to main() only the job of loading the proper DLL OR making CPU-specific executables w/o DLLs.

Here's the exact answer that I was looking for.
GCC's __attribute__((ifunc("resolver")))
It requires fairly recent binutils.
There's a good article that describes this extension: Gnu support for CPU dispatching - sort of...

Lazy loading ELF symbols from shared libraries is described in section 1.5.5 of Ulrich Drepper's DSO How To (updated 2011-12-10). For ARM it is described in section 3.1.3 of ELF for ARM.
EDIT: With the STT_GNU_IFUNC extension mentioned by R. I forgot that was an extension. GNU Binutils supports that for ARM, apparently since March 2011, according to changelog.
If you want to call functions without the indirection of the PLT, I suggest function pointers or per-arch shared libraries inside which function calls don't go through PLTs (beware: calling an exported function is through the PLT).
I wouldn't patch the code at runtime. I mean, you can. You can add a build step: after compilation disassemble your binaries, find all offsets of calls to functions that have multi-arch alternatives, build table of patch locations, link that into your code. In main, remap the text segment writeable, patch the offsets according to the table you prepared, map it back to read-only, flush the instruction cache, and proceed. I'm sure it will work. How much performance do you expect to gain by this approach? I think loading different shared libraries at runtime is easier. And function pointers are easier still.