I have this small testcode atfork_demo.c:
#include <stdio.h>
#include <pthread.h>
void hello_from_fork_prepare() {
printf("Hello from atfork prepare.\n");
fflush(stdout);
}
void register_hello_from_fork_prepare() {
pthread_atfork(&hello_from_fork_prepare, 0, 0);
}
Now, I compile it in two different ways:
gcc -shared -fPIC atfork_demo.c -o atfork_demo1.so
gcc -shared -fPIC atfork_demo.c -o atfork_demo2.so -lpthread
My demo main atfork_demo_main.c is this:
#include <dlfcn.h>
#include <stdio.h>
#include <unistd.h>
int main(int argc, const char** argv) {
if(argc <= 1) {
printf("usage: ... lib.so\n");
return 1;
}
void* plib = dlopen("libpthread.so.0", RTLD_NOW|RTLD_GLOBAL);
if(!plib) {
printf("cannot load pthread, error %s\n", dlerror());
return 1;
}
void* lib = dlopen(argv[1], RTLD_LAZY);
if(!lib) {
printf("cannot load %s, error %s\n", argv[1], dlerror());
return 1;
}
void (*reg)();
reg = dlsym(lib, "register_hello_from_fork_prepare");
if(!reg) {
printf("did not found func, error %s\n", dlerror());
return 1;
}
reg();
fork();
}
Which I compile like this:
gcc atfork_demo_main.c -o atfork_demo_main.exec -ldl
Now, I have another small demo atfork_patch.c where I want to override pthread_atfork:
#include <stdio.h>
int pthread_atfork(void (*prepare)(void), void (*parent)(void), void (*child)(void)) {
printf("Ignoring pthread_atfork call!\n");
fflush(stdout);
return 0;
}
Which I compile like this:
gcc -shared -O2 -fPIC patch_atfork.c -o patch_atfork.so
And then I set LD_PRELOAD=./atfork_patch.so, and do these two calls:
./atfork_demo_main.exec ./atfork_demo1.so
./atfork_demo_main.exec ./atfork_demo2.so
In the first case, the LD_PRELOAD-override of pthread_atfork worked and in the second, it did not. I get the output:
Ignoring pthread_atfork call!
Hello from atfork prepare.
So, now to the question(s):
Why did it not work in the second case?
How can I make it work also in the second case, i.e. also override it?
In my real use case, atfork_demo is some library which I cannot change. I also cannot change atfork_demo_main but I can make it load any other code. I would prefer if I can just do it with some change in atfork_patch.
You get some more debug output if you also use LD_DEBUG=all. Maybe interesting is this bit, for the second case:
841: symbol=__register_atfork; lookup in file=./atfork_demo_main.exec [0]
841: symbol=__register_atfork; lookup in file=./atfork_patch_extended.so [0]
841: symbol=__register_atfork; lookup in file=/lib/x86_64-linux-gnu/libdl.so.2 [0]
841: symbol=__register_atfork; lookup in file=/lib/x86_64-linux-gnu/libc.so.6 [0]
841: binding file ./atfork_demo2.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `__register_atfork' [GLIBC_2.3.2]
So, it searches for the symbol __register_atfork. I added that to atfork_patch_extended.so but it doesn't find it and uses it from libc instead. How can I make it find and use my __register_atfork?
As a side note, my main goal is to ignore the atfork handlers when fork() is called, but this is not the question here, but actually here. One solution to that, which seems to work, is to override fork() itself by this:
pid_t fork(void) {
return syscall(SYS_clone, SIGCHLD, 0);
}
Before answering this question, I would stress that this is a really bad idea for any production application.
If you are using a third party library that puts such constraints in place, then think about an alternative solution, such as forking early to maintain a "helper" process, with a pipe between you and it... then, when you need to call exec(), you can request that it does the work (fork(), exec()) on your behalf.
Patching or otherwise side-stepping the services of a system call such as pthread_atfork() is just asking for trouble (missed events, memory leaks, crashes, etc...).
As #Sergio pointed out, pthread_atfork() is actually built into atfork_demo2.so, so you can't do anything to override it... However examining the disassembly / source of pthread_atfork() gives you a decent hint about how achieve what you're asking:
0000000000000830 <__pthread_atfork>:
830: 48 8d 05 f9 07 20 00 lea 0x2007f9(%rip),%rax # 201030 <__dso_handle>
837: 48 85 c0 test %rax,%rax
83a: 74 0c je 848 <__pthread_atfork+0x18>
83c: 48 8b 08 mov (%rax),%rcx
83f: e9 6c fe ff ff jmpq 6b0 <__register_atfork#plt>
844: 0f 1f 40 00 nopl 0x0(%rax)
848: 31 c9 xor %ecx,%ecx
84a: e9 61 fe ff ff jmpq 6b0 <__register_atfork#plt>
or the source (from here):
int
pthread_atfork (void (*prepare) (void),
void (*parent) (void),
void (*child) (void))
{
return __register_atfork (prepare, parent, child, &__dso_handle == NULL ? NULL : __dso_handle);
}
As you can see, pthread_atfork() does nothing aside from calling __register_atfork()... so patch that instead!
The content of atfork_patch.c now becomes: (using __register_atfork()'s prototype, from here / here)
#include <stdio.h>
int __register_atfork (void (*prepare) (void), void (*parent) (void),
void (*child) (void), void *dso_handle) {
printf("Ignoring pthread_atfork call!\n");
fflush(stdout);
return 0;
}
This works for both demos:
$ LD_PRELOAD=./atfork_patch.so ./atfork_demo_main.exec ./atfork_demo1.so
Ignoring pthread_atfork call!
$ LD_PRELOAD=./atfork_patch.so ./atfork_demo_main.exec ./atfork_demo2.so
Ignoring pthread_atfork call!
It doesn't work for the second case because there is nothing to override. Your second library is linked statically with pthread library:
$ readelf --symbols atfork_demo1.so | grep pthread_atfork
7: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND pthread_atfork
54: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND pthread_atfork
$ readelf --symbols atfork_demo2.so | grep pthread_atfork
41: 0000000000000000 0 FILE LOCAL DEFAULT ABS pthread_atfork.c
47: 0000000000000830 31 FUNC LOCAL DEFAULT 12 __pthread_atfork
49: 0000000000000830 31 FUNC LOCAL DEFAULT 12 pthread_atfork
So it will use local pthread_atfork each time, regardless of LD_PRELOAD or any other loaded libraries.
How to overcome that? Looks like for described configuration it is not possible since you need to modify atfork_demo library or main executable anyway.
I did an experiment to see what kind of assembly language would be generate if I try to get the same function to compile in there twice. I did the following:
I created two simple test files and their corresponding headers. Let's call them a.c/a.h, and b.c/b.h. Here are the contents of those files:
a.h:
#ifndef __A_H__
#define __A_H__
int a( void );
#endif
b.h:
#ifndef __B_H__
#define __B_H__
int b( void );
#endif
a.c:
#include "a.h"
int a( void )
{
return 1;
}
b.c:
#include "b.h"
#include "a.h"
int b( void )
{
return 1 + a();
}
I then created a static archive for a:
gcc -c a.c -o a.o
ar -rsc a.a a.o
and the same for b, including the static archive for a this time:
gcc -c b.c -o b.o
ar -rsc b.a a.a b.o
At this point, I disassemble the static archive for b to verify that it has assembly code for both functions a() and b(). It does.
Now, I define one last file:
main.c:
#include <stdio.h>
#include "a.h"
#include "b.h"
int main( void )
{
printf( "%d %d\n", a(), b() );
return 0;
}
and I compile it thusly:
gcc main.c a.a b.a -o main
This works fine. When I disassemble it, I see the following definitions for a and b in the code:
140 0000000000400561 <a>:
141 400561: 55 push %rbp
142 400562: 48 89 e5 mov %rsp,%rbp
143 400565: b8 01 00 00 00 mov $0x1,%eax
144 40056a: 5d pop %rbp
145 40056b: c3 retq
146
147 000000000040056c <b>:
148 40056c: 55 push %rbp
149 40056d: 48 89 e5 mov %rsp,%rbp
150 400570: e8 ec ff ff ff callq 400561 <a>
151 400575: 83 c0 01 add $0x1,%eax
152 400578: 5d pop %rbp
153 400579: c3 retq
154 40057a: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
As you can see, the code has clearly defined b as calling a rather than inlining it, however, there is only one definition of a in the code, no duplicates.
It seems that gcc has either:
Detected the duplicate object code and removed the duplicates
--or--
the b archive was used first, and it included the reference to int a(), so the a archive was ignored.
My question is: is this behavior circumstantial to my test or is it standard, and can I expect the same behavior from other compilers? Obviously duplicate code is one problem, however there could be duplicate global references as well. Is it safe/good practice to build a large application that has multiple dependency paths to the same static archive? Are there less obvious situations than just duplicate symbol names where issues can arise when doing this?
Asking this because I've been playing with this idea for a project I'm on, and want to make the right choices.
My question is: is this behavior circumstantial to my test or is it standard, and can I expect the same behavior from other compilers?
As far as the compiler itself is concerned, there is no issue: you have one definition for each function among your sources.
As far as ar is concerned, you also have no issue: neither of the archives you built contains any duplicate symbols.
Different linkers may exhibit different behaviors, however. It is conceivable that some would reject linking archives that contain duplicate external symbols. Typical UNIX linkers will handle the situation you present, but they may vary in some details, such as whether a duplicate copy of function a() is included in the binary.
Obviously duplicate code is one problem, however there could be duplicate global references as well. Is it safe/good practice to build a large application that has multiple dependency paths to the same static archive?
"Multiple paths to the same static archive" does not seem to be a good characterization of the situation you present. In neither case do you provide the same archive more than once. Rather, in the b case you provide different archives with duplicate members. Linkers generally do not have problems with specifying the same archive multiple times in the same link command. Under some circumstances it may even be necessary to do so; it should not present a problem.
Providing distinct archives with duplicate members probably will not present a problem, except possibly for bloating your code with duplicate function implementations. This is a bit less certain, but I doubt it would present a problem in practice.
Whether that's good practice is a matter of opinion, but I'm inclined to think not. It's also not clear to me what gain you seen in such an approach. On the other hand, I won't be sharpening any stakes or preparing any kindling if you decide to go ahead anyway.
I have a homework assignment that requires us to open, read and write to file using system calls rather than standard libraries. To debug it, I want to use std libraries when test-compiling the project. I did this:
#ifdef HOME
//Home debug prinf function
#include <stdio.h>
#else
//Dummy prinf function
int printf(const char* ff, ...) {
return 0;
}
#endif
And I compile it like this: gcc -DHOME -m32 -static -O2 -o main.exe main.c
Problem is that I with -nostdlib argument, the standard entry point is void _start but without the argument, the entry point is int main(const char** args). You'd probably do this:
//Normal entry point
int main(const char** args) {
_start();
}
//-nostdlib entry point
void _start() {
//actual code
}
In that case, this is what you get when you compile without -nostdlib:
/tmp/ccZmQ4cB.o: In function `_start':
main.c:(.text+0x20): multiple definition of `_start'
/usr/lib/gcc/i486-linux-gnu/4.7/../../../i386-linux-gnu/crt1.o:(.text+0x0): first defined here
Therefore I need to detect whether stdlib is included and do not define _start in that case.
The low-level entry point is always _start for your system. With -nostdlib, its definition is omitted from linking so you have to provide one. Without -nostdlib, you must not attempt to define it; even if this didn't get a link error from duplicate definition, it would horribly break the startup of the standard library runtime.
Instead, try doing it the other way around:
int main() {
/* your code here */
}
#ifdef NOSTDLIB_BUILD /* you need to define this with -D */
void _start() {
main();
}
#endif
You could optionally add fake arguments to main. It's impossible to get the real ones from a _start written in C though. You'd need to write _start in asm for that.
Note that -nostdlib is a linker option, not compile-time, so there's no way to automatically determine at compile-time that that -nostdlib is going to be used. Instead just make your own macro and pass it on the command line as -DNOSTDLIB_BUILD or similar.
Here is a minimal example for an "executable" shared library (assumed file name: mini.c):
// Interpreter path is different on some systems
//+definitely different for 32-Bit machines
const char my_interp[] __attribute__((section(".interp")))
= "/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2";
#include <stdio.h>
#include <stdlib.h>
int entry() {
printf("WooFoo!\n");
exit (0);
}
If one compiles it with e.g.: gcc -fPIC -o mini.so -shared -Wl,-e,entry mini.c. "Running" the resulting .so will look like this:
confus#confusion:~$ ./mini.so
WooFoo!
My question is now:
How do I have to change the above program to pass command line arguments to a call of the .so-file? An example shell session after the change might e.g. look like this:
confus#confusion:~$ ./mini.so 2 bar
1: WooFoo! bar!
2: WooFoo! bar!
confus#confusion:~$ ./mini.so 3 bla
1: WooFoo! bla!
2: WooFoo! bla!
3: WooFoo! bla!
5: WooFoo! Bar!
It would also be nice to detect on compile time, wheter the target is a 32-Bit or 64-Bit binary to change the interpreter string accordingly. Otherwise one gets a "Accessing a corrupted shared library" warning. Something like:
#ifdef SIXTY_FOUR_BIT
const char my_interp[] __attribute__((section(".interp"))) = "/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2";
#else
const char my_interp[] __attribute__((section(".interp"))) = "/lib/ld-linux.so.2";
#endif
Or even better, to detect the appropriate path fully automatically to ensure it is right for the system the library is compiled on.
How do I have to change the above program to pass command line arguments to a call of the .so-file?
When you run your shared library, argc and argv will be passed to your entry function on the stack.
The problem is that the calling convention used when you compile your shared library on x86_64 linux is going to be that of the System V AMD64 ABI, which doesn't take arguments on the stack but in registers.
You'll need some ASM glue code that fetches argument from the stack and puts them into the right registers.
Here's a simple .asm file you can save as entry.asm and just link with:
global _entry
extern entry, _GLOBAL_OFFSET_TABLE_
section .text
BITS 64
_entry:
mov rdi, [rsp]
mov rsi, rsp
add rsi, 8
call .getGOT
.getGOT:
pop rbx
add rbx,_GLOBAL_OFFSET_TABLE_+$$-.getGOT wrt ..gotpc
jmp entry wrt ..plt
That code copies the arguments from the stack into the appropriate registers, and then calls your entry function in a position-independent way.
You can then just write your entry as if it was a regular main function:
// Interpreter path is different on some systems
//+definitely different for 32-Bit machines
const char my_interp[] __attribute__((section(".interp")))
= "/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2";
#include <stdio.h>
#include <stdlib.h>
int entry(int argc, char* argv[]) {
printf("WooFoo! Got %d args!\n", argc);
exit (0);
}
And this is how you would then compile your library:
nasm entry.asm -f elf64
gcc -fPIC -o mini.so -shared -Wl,-e,_entry mini.c entry.o
The advantage is that you won't have inline asm statements mixed with your C code, instead your real entry point is cleanly abstracted away in a start file.
It would also be nice to detect on compile time, wheter the target is a 32-Bit or 64-Bit binary to change the interpreter string accordingly.
Unfortunately, there's no completely clean, reliable way to do that. The best you can do is rely on your preferred compiler having the right defines.
Since you use GCC you can write your C code like this:
#if defined(__x86_64__)
const char my_interp[] __attribute__((section(".interp")))
= "/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2";
#elif defined(__i386__)
const char my_interp[] __attribute__((section(".interp")))
= "/lib/ld-linux.so.2";
#else
#error Architecture or compiler not supported
#endif
#include <stdio.h>
#include <stdlib.h>
int entry(int argc, char* argv[]) {
printf("%d: WooFoo!\n", argc);
exit (0);
}
And have two different start files.
One for 64bit:
global _entry
extern entry, _GLOBAL_OFFSET_TABLE_
section .text
BITS 64
_entry:
mov rdi, [rsp]
mov rsi, rsp
add rsi, 8
call .getGOT
.getGOT:
pop rbx
add rbx,_GLOBAL_OFFSET_TABLE_+$$-.getGOT wrt ..gotpc
jmp entry wrt ..plt
And one for 32bit:
global _entry
extern entry, _GLOBAL_OFFSET_TABLE_
section .text
BITS 32
_entry:
mov edi, [esp]
mov esi, esp
add esi, 4
call .getGOT
.getGOT:
pop ebx
add ebx,_GLOBAL_OFFSET_TABLE_+$$-.getGOT wrt ..gotpc
push edi
push esi
jmp entry wrt ..plt
Which means you now have two slightly different ways to compile your library for each target.
For 64bit:
nasm entry.asm -f elf64
gcc -fPIC -o mini.so -shared -Wl,-e,_entry mini.c entry.o -m64
And for 32bit:
nasm entry32.asm -f elf32
gcc -fPIC -o mini.so -shared -Wl,-e,_entry mini.c entry32.o -m32
So to sum it up you now have two start files entry.asm and entry32.asm, a set of defines in your mini.c that picks the right interpreter automatically, and two slightly different ways of compiling your library depending on the target.
So if we really want to go all the way, all that's left is to create a Makefile that detects the right target and builds your library accordingly.Let's do just that:
ARCH := $(shell getconf LONG_BIT)
all: build_$(ARCH)
build_32:
nasm entry32.asm -f elf32
gcc -fPIC -o mini.so -shared -Wl,-e,_entry mini.c entry32.o -m32
build_64:
nasm entry.asm -f elf64
gcc -fPIC -o mini.so -shared -Wl,-e,_entry mini.c entry.o -m64
And we're done here. Just run make to build your library and let the magic happen.
Add
int argc;
char **argv;
asm("mov 8(%%rbp), %0" : "=&r" (argc));
asm("mov %%rbp, %0\n"
"add $16, %0" : "=&r" (argv));
to the top of your entry function. On x86_64 platforms, this will give you access to the arguments.
The LNW article that John Bollinger linked to in the comments explains why this code works. It might interest you why this is not required when you write a normal C program, or rather, why it does not suffice do just give your entry function the two usual int argc, char **argv arguments: The entry point for a C program normally is not the main function, but instead an assembler function by glibc that does some preparations for you - among others fetch the arguments from the stack - and that eventually (via some intermediate functions) calls your main function. Note that this also means that you might experience other problems, since you skip this initialization! For some history, the cdecl wikipedia page, especially on the difference between x86 and x86_64, might be of further interest.
Let's say that I have a function that gets called in multiple parts of a program. Let's also say that I have a particular call to that function that is in an extremely performance-sensitive section of code (e.g., a loop that iterates tens of millions of times and where each microsecond counts). Is there a way that I can force the complier (gcc in my case) to inline that single, particular function call, without inlining the others?
EDIT: Let me make this completely clear: this question is NOT about forcing gcc (or any other compiler) to inline all calls to a function; rather, it it about requesting that the compiler inline a particular call to a function.
In C (as opposed to C++) there's no standard way to suggest that a function should be inlined. It's only vender-specific extensions.
However you specify it, as far as I know the compiler will always try to inline every instance, so use that function only once:
original:
int MyFunc() { /* do stuff */ }
change to:
inline int MyFunc_inlined() { /* do stuff */ }
int MyFunc() { return MyFunc_inlined(); }
Now, in theplaces where you want it inlined, use MyFunc_inlined()
Note: "inline" keyword in the above is just a placeholder for whatever syntax gcc uses to force an inlining. If H2CO3's deleted answer is to be trusted, that would be:
static inline __attribute__((always_inline)) int MyFunc_inlined() { /* do stuff */ }
It is possible to enable inlining per translation unit (but not per call). Though this is not an answer for the question and is an ugly trick, it conforms to C standard and may be interesting as related stuff.
The trick is to use extern definition where you do not want to inline, and extern inline where you need inlining.
Example:
$ cat func.h
int func();
$ cat func.c
int func() { return 10; }
$ cat func_inline.h
extern inline int func() { return 5; }
$ cat main.c
#include <stdio.h>
#ifdef USE_INLINE
# include "func_inline.h"
#else
# include "func.h"
#endif
int main() { printf("%d\n", func()); return 0; }
$ gcc main.c func.c && ./a.out
10 // non-inlined version
$ gcc main.c func.c -DUSE_INLINE && ./a.out
10 // non-inlined version
$ gcc main.c func.c -DUSE_INLINE -O2 && ./a.out
5 // inlined!
You can also use non-standard attribute (e.g. __attribute__(always_inline)) in GCC) for extern inline definition, instead of relying on -O2.
BTW, the trick is used in glibc.
the traditional way to force inline a function in C was to not use a function at all, but to use a function like macro. This method will always inline the function, but there are some problems with function like macros. For example:
#define ADD(x, y) ((x) + (y))
printf("%d\n", ADD(2, 2));
There is also the inline keyword, which was added to C in the C99 standard. Notably, Microsoft's Visual C compiler doesn't support C99, and thus you can't use inline with that (miserable) compiler. Inline only hints to the compiler that you want the function inlined - it does not guarantee it.
GCC has an extension which requires the compiler to inline the function.
inline __attribute__((always_inline)) int add(int x, int y) {
return x + y;
}
To make this cleaner, you may want want to use a macro:
#define ALWAYS_INLINE inline __attribute__((always_inline))
ALWAYS_INLINE int add(int x, int y) {
return x + y;
}
I don't know of a direct way of having a function that can be force inlined on certain calls. But you can combine the techniques like this:
#define ALWAYS_INLINE inline __attribute__((always_inline))
#define ADD(x, y) ((x) + (y))
ALWAYS_INLINE int always_inline_add(int x, int y) {
return ADD(x, y);
}
int normal_add(int x, int y) {
return ADD(x, y);
}
Or, you could just have this:
#define ADD(x, y) ((x) + (y))
int add(int x, int y) {
return ADD(x, y);
}
int main() {
printf("%d\n", ADD(2,2)); // always inline
printf("%d\n", add(2,2)); // normal function call
return 0;
}
Also, note that forcing the inline of a function might not make your code faster. Inline functions cause larger code to be generated, which might cause more cache misses to occur.
I hope that helps.
The answer is it depends on your function, what you request and the nature of your function. Your best bet is to:
tell the compiler you want it inlined
make the function static (be careful with extern as it's semantics change a little in gcc in some modes)
set the compiler options to inform the optimizer you want inlining, and set inline limits appropriately
turn on any couldn't inline warnings on the compiler
verify the output (you could check the assembler generated) that the function is in-lined.
Compiler hints
The answers here cover just one side of inlining, the language hints to the compiler. When the standard says:
Making a function an inline function suggests that calls to the function be as
fast as possible. The extent to which such suggestions are effective is
implementation-defined
This can be the case for other stronger hints such as:
GNU's __attribute__((always_inline)): Generally, functions are not inlined unless optimization is specified. For functions declared inline, this attribute inlines the function even if no optimization level was specified.
Microsoft's __forceinline: The __forceinline keyword overrides the cost/benefit analysis and relies on the judgment of the programmer instead. Exercise caution when using __forceinline. Indiscriminate use of __forceinline can result in larger code with only marginal performance gains or, in some cases, even performance losses (due to increased paging of a larger executable, for example).
Even both of these would rely on the inlining being possible, and crucially on compiler flags. To work with inlined functions you also need to understand the optimisation settings of your compiler.
It may be worth saying inlining can also be used to provide replacements for existing functions just for the compilation unit you are in. This can be used when an approximate answers are good enough for your algorithm, or a result can be achieved in a faster way with local data-structures.
An inline definition
provides an alternative to an external definition, which a translator may use to implement
any call to the function in the same translation unit. It is unspecified whether a call to the
function uses the inline definition or the external definition.
Some functions cannot be inlined
For example, for the GNU compiler functions that cannot be inlined are:
Note that certain usages in a function definition can make it unsuitable for inline substitution. Among these usages are: variadic functions, use of alloca, use of variable-length data types (see Variable Length), use of computed goto (see Labels as Values), use of nonlocal goto, and nested functions (see Nested Functions). Using -Winline warns when a function marked inline could not be substituted, and gives the reason for the failure.
So even always_inline may not do what you expect.
Compiler Options
Using C99's inline hints will rely on you instructing the compiler the inline behavour you are looking for.
GCC for instance has:
-fno-inline, -finline-small-functions, -findirect-inlining, -finline-functions, -finline-functions-called-once, -fearly-inlining, -finline-limit=n
Microsoft compiler also has options that dictate the effectiveness of inline. Some compilers will also allow optimization to take into account running profile.
I do think it's worth seeing inlining in the broader context of program optimization.
Preventing Inlining
You mention that you don't want certain functions inlined. This might be done by setting something like __attribute__((always_inline)) without turning on the optimizer. However you would probably would want the optimizer. One option here would be to hint you don't want it: __attribute__ ((noinline)). But why would this be the case?
Other forms of optimization
You may also consider how you might restructure your loop and avoiding branches. Branch prediction can have a dramatic effect. For an interesting discussion on this see: Why is it faster to process a sorted array than an unsorted array?
Then you also might smaller inner loops to be unrolled and to look at invariants.
There's a kernel source that uses #defines in a very interesting way to define several different named functions with the same body. This solves the problem of having two different functions to maintain. (I forgot which one it was...). My idea is based on this same principle.
The way to use the defines is that you'll define the inline function on the compilation unit you need it. To demonstrate the method I'll use a simple function:
int add(int a, int b);
It works like this: you make a function generator #define in a header file and declare the function prototype of the normal version of the function (the one not inlined).
Then you declare two separate function generators, one for the normal function and one for the inline function. The inline function you declare as static __inline__. When you need to call the inline function in one of your files, you use the generator define to get the source for it. In all other files you need to use the normal function, you just include the header with the prototype.
The code was tested on:
Intel(R) Core(TM) i5-3330 CPU # 3.00GHz
Kernel Version: 3.16.0-49-generic
GCC 4.8.4
Code is worth more than a thousand words, so:
File Hierarchy
+
| Makefile
| add.h
| add.c
| loop.c
| loop2.c
| loop3.c
| loops.h
| main.c
add.h
#define GENERATE_ADD(type, prefix) \
type int prefix##add(int a, int b) { return a + b; }
#define DEFINE_ADD() GENERATE_ADD(,)
#define DEFINE_INLINE_ADD() GENERATE_ADD(static __inline__, inline_)
int add(int, int);
This doesn't look nice, but cuts the work of maintaining two different functions. The function is fully defined within the GENERATE_ADD(type,prefix) macro, so if you ever need to change the function, you change this macro and everything else changes.
Next, DEFINE_ADD() will be called from add.c to generate the normal version of add. DEFINE_INLINE_ADD() will give you access to a function called inline_add, which has the same signature as your normal addfunction, but it has a different name (the inline_ prefix).
Note: I didn't use the __attribute((always_inline))__ when using the -O3 flag - the __inline__ did the job. However, if you don't wanna use -O3, use:
#define DEFINE_INLINE_ADD() GENERATE_ADD(static __inline__ __attribute__((always_inline)), inline_)
add.c
#include "add.h"
DEFINE_ADD()
Simple call to the DEFINE_ADD() macro generator. This will declare the normal version of the function (the one that won't get inlined).
loop.c
#include <stdio.h>
#include "add.h"
DEFINE_INLINE_ADD()
int loop(void)
{
register int i;
for (i = 0; i < 100000; i++)
printf("%d\n", inline_add(i + 1, i + 2));
return 0;
}
Here in loop.c you can see the call to DEFINE_INLINE_ADD(). This gives this function access to the inline_add function. When you compile, all inline_add function will be inlined.
loop2.c
#include <stdio.h>
#include "add.h"
int loop2(void)
{
register int i;
for (i = 0; i < 100000; i++)
printf("%d\n", add(i + 1, i + 2));
return 0;
}
This is to show you can use the normal version of add normally from other files.
loop3.c
#include <stdio.h>
#include "add.h"
DEFINE_INLINE_ADD()
int loop3(void)
{
register int i;
printf ("add: %d\n", add(2,3));
printf ("add: %d\n", add(4,5));
for (i = 0; i < 100000; i++)
printf("%d\n", inline_add(i + 1, i + 2));
return 0;
}
This is to show that you can use both the functions in the same compilation unit, yet one of the functions will be inlined, and the other wont (see GDB disass bellow for details).
loops.h
/* prototypes for main */
int loop (void);
int loop2 (void);
int loop3 (void);
main.c
#include <stdio.h>
#include <stdlib.h>
#include "add.h"
#include "loops.h"
int main(void)
{
printf("%d\n", add(1,2));
printf("%d\n", add(2,3));
loop();
loop2();
loop3();
return 0;
}
Makefile
CC=gcc
CFLAGS=-Wall -pedantic --std=c11
main: add.o loop.o loop2.o loop3.o main.o
${CC} -o $# $^ ${CFLAGS}
add.o: add.c
${CC} -c $^ ${CFLAGS}
loop.o: loop.c
${CC} -c $^ -O3 ${CFLAGS}
loop2.o: loop2.c
${CC} -c $^ ${CFLAGS}
loop3.o: loop3.c
${CC} -c $^ -O3 ${CFLAGS}
If you use the __attribute__((always_inline)) you can change the Makefile to:
CC=gcc
CFLAGS=-Wall -pedantic --std=c11
main: add.o loop.o loop2.o loop3.o main.o
${CC} -o $# $^ ${CFLAGS}
%.o: %.c
${CC} -c $^ ${CFLAGS}
Compilation
$ make
gcc -c add.c -Wall -pedantic --std=c11
gcc -c loop.c -O3 -Wall -pedantic --std=c11
gcc -c loop2.c -Wall -pedantic --std=c11
gcc -c loop3.c -O3 -Wall -pedantic --std=c11
gcc -Wall -pedantic --std=c11 -c -o main.o main.c
gcc -o main add.o loop.o loop2.o loop3.o main.o -Wall -pedantic --std=c11
Disassembly
$ gdb main
(gdb) disass add
0x000000000040059d <+0>: push %rbp
0x000000000040059e <+1>: mov %rsp,%rbp
0x00000000004005a1 <+4>: mov %edi,-0x4(%rbp)
0x00000000004005a4 <+7>: mov %esi,-0x8(%rbp)
0x00000000004005a7 <+10>:mov -0x8(%rbp),%eax
0x00000000004005aa <+13>:mov -0x4(%rbp),%edx
0x00000000004005ad <+16>:add %edx,%eax
0x00000000004005af <+18>:pop %rbp
0x00000000004005b0 <+19>:retq
(gdb) disass loop
0x00000000004005c0 <+0>: push %rbx
0x00000000004005c1 <+1>: mov $0x3,%ebx
0x00000000004005c6 <+6>: nopw %cs:0x0(%rax,%rax,1)
0x00000000004005d0 <+16>:mov %ebx,%edx
0x00000000004005d2 <+18>:xor %eax,%eax
0x00000000004005d4 <+20>:mov $0x40079d,%esi
0x00000000004005d9 <+25>:mov $0x1,%edi
0x00000000004005de <+30>:add $0x2,%ebx
0x00000000004005e1 <+33>:callq 0x4004a0 <__printf_chk#plt>
0x00000000004005e6 <+38>:cmp $0x30d43,%ebx
0x00000000004005ec <+44>:jne 0x4005d0 <loop+16>
0x00000000004005ee <+46>:xor %eax,%eax
0x00000000004005f0 <+48>:pop %rbx
0x00000000004005f1 <+49>:retq
(gdb) disass loop2
0x00000000004005f2 <+0>: push %rbp
0x00000000004005f3 <+1>: mov %rsp,%rbp
0x00000000004005f6 <+4>: push %rbx
0x00000000004005f7 <+5>: sub $0x8,%rsp
0x00000000004005fb <+9>: mov $0x0,%ebx
0x0000000000400600 <+14>:jmp 0x400625 <loop2+51>
0x0000000000400602 <+16>:lea 0x2(%rbx),%edx
0x0000000000400605 <+19>:lea 0x1(%rbx),%eax
0x0000000000400608 <+22>:mov %edx,%esi
0x000000000040060a <+24>:mov %eax,%edi
0x000000000040060c <+26>:callq 0x40059d <add>
0x0000000000400611 <+31>:mov %eax,%esi
0x0000000000400613 <+33>:mov $0x400794,%edi
0x0000000000400618 <+38>:mov $0x0,%eax
0x000000000040061d <+43>:callq 0x400470 <printf#plt>
0x0000000000400622 <+48>:add $0x1,%ebx
0x0000000000400625 <+51>:cmp $0x1869f,%ebx
0x000000000040062b <+57>:jle 0x400602 <loop2+16>
0x000000000040062d <+59>:mov $0x0,%eax
0x0000000000400632 <+64>:add $0x8,%rsp
0x0000000000400636 <+68>:pop %rbx
0x0000000000400637 <+69>:pop %rbp
0x0000000000400638 <+70>:retq
(gdb) disass loop3
0x0000000000400640 <+0>: push %rbx
0x0000000000400641 <+1>: mov $0x3,%esi
0x0000000000400646 <+6>: mov $0x2,%edi
0x000000000040064b <+11>:mov $0x3,%ebx
0x0000000000400650 <+16>:callq 0x40059d <add>
0x0000000000400655 <+21>:mov $0x400798,%esi
0x000000000040065a <+26>:mov %eax,%edx
0x000000000040065c <+28>:mov $0x1,%edi
0x0000000000400661 <+33>:xor %eax,%eax
0x0000000000400663 <+35>:callq 0x4004a0 <__printf_chk#plt>
0x0000000000400668 <+40>:mov $0x5,%esi
0x000000000040066d <+45>:mov $0x4,%edi
0x0000000000400672 <+50>:callq 0x40059d <add>
0x0000000000400677 <+55>:mov $0x400798,%esi
0x000000000040067c <+60>:mov %eax,%edx
0x000000000040067e <+62>:mov $0x1,%edi
0x0000000000400683 <+67>:xor %eax,%eax
0x0000000000400685 <+69>:callq 0x4004a0 <__printf_chk#plt>
0x000000000040068a <+74>:nopw 0x0(%rax,%rax,1)
0x0000000000400690 <+80>:mov %ebx,%edx
0x0000000000400692 <+82>:xor %eax,%eax
0x0000000000400694 <+84>:mov $0x40079d,%esi
0x0000000000400699 <+89>:mov $0x1,%edi
0x000000000040069e <+94>:add $0x2,%ebx
0x00000000004006a1 <+97>:callq 0x4004a0 <__printf_chk#plt>
0x00000000004006a6 <+102>:cmp $0x30d43,%ebx
0x00000000004006ac <+108>:jne 0x400690 <loop3+80>
0x00000000004006ae <+110>:xor %eax,%eax
0x00000000004006b0 <+112>:pop %rbx
0x00000000004006b1 <+113>:retq
Symbol table
$ objdump -t main | grep add
0000000000000000 l df *ABS* 0000000000000000 add.c
000000000040059d g F .text 0000000000000014 add
$ objdump -t main | grep loop
0000000000000000 l df *ABS* 0000000000000000 loop.c
0000000000000000 l df *ABS* 0000000000000000 loop2.c
0000000000000000 l df *ABS* 0000000000000000 loop3.c
00000000004005c0 g F .text 0000000000000032 loop
00000000004005f2 g F .text 0000000000000047 loop2
0000000000400640 g F .text 0000000000000072 loop3
$ objdump -t main | grep main
main: file format elf64-x86-64
0000000000000000 l df *ABS* 0000000000000000 main.c
0000000000000000 F *UND* 0000000000000000 __libc_start_main##GLIBC_2.2.5
00000000004006b2 g F .text 000000000000005a main
$ objdump -t main | grep inline
$
Well, that's it. After 3 hours of banging my head in the keyboard trying to figure it out, this was the best I could come up with. Feel free to point any errors, I'll really appreciate it. I got really interested in this particular inline one function call.
If you do not mind having two names for the same function, you could create a small wrapper around your function to "block" the always_inline attribute from affecting every call. In my example, loop_inlined would be the name you would use in performance-critical sections, while the plain loop would be used everywhere else.
inline.h
#include <stdlib.h>
static inline int loop_inlined() __attribute__((always_inline));
int loop();
static inline int loop_inlined() {
int n = 0, i;
for(i = 0; i < 10000; i++)
n += rand();
return n;
}
inline.c
#include "inline.h"
int loop() {
return loop_inlined();
}
main.c
#include "inline.h"
#include <stdio.h>
int main(int argc, char *argv[]) {
printf("%d\n", loop_inlined());
printf("%d\n", loop());
return 0;
}
This works regardless of the optimization level. Compiling with gcc inline.c main.c on Intel gives:
4011e6: c7 44 24 18 00 00 00 movl $0x0,0x18(%esp)
4011ed: 00
4011ee: eb 0e jmp 4011fe <_main+0x2e>
4011f0: e8 5b 00 00 00 call 401250 <_rand>
4011f5: 01 44 24 1c add %eax,0x1c(%esp)
4011f9: 83 44 24 18 01 addl $0x1,0x18(%esp)
4011fe: 81 7c 24 18 0f 27 00 cmpl $0x270f,0x18(%esp)
401205: 00
401206: 7e e8 jle 4011f0 <_main+0x20>
401208: 8b 44 24 1c mov 0x1c(%esp),%eax
40120c: 89 44 24 04 mov %eax,0x4(%esp)
401210: c7 04 24 60 30 40 00 movl $0x403060,(%esp)
401217: e8 2c 00 00 00 call 401248 <_printf>
40121c: e8 7f ff ff ff call 4011a0 <_loop>
401221: 89 44 24 04 mov %eax,0x4(%esp)
401225: c7 04 24 60 30 40 00 movl $0x403060,(%esp)
40122c: e8 17 00 00 00 call 401248 <_printf>
The first 7 instructions are the inlined call, and the regular call happens 5 instructions later.
Here's a suggestion, write the body of the code in a separate header file.
Include the header file in place where it has to be inline and into a body in a C file for other calls.
void demo(void)
{
#include myBody.h
}
importantloop
{
// code
#include myBody.h
// code
}
I assume that your function is a little one since you want to inline it, if so why don't you write it in asm?
As for inlining only a specific call to a function I don't think there exists something to do this task for you. Once a function is declared as inline and if the compiler will inline it for you it will do it everywhere it sees a call to that function.