Related
On Windows data can be loaded from DLLs, but it requires indirection through a pointer in the import address table. As a result, the compiler must know if an object that is being accessed is being imported from a DLL by using the __declspec(dllimport) type specifier.
This is unfortunate because it means a that a header for a Windows library designed to be used as either a static library or a dynamic library needs to know which version of the library the program is linking to. This requirement is not applicable to functions, which are transparently emulated for DLLs with a stub function calling the real function, whose address is stored in the import address table.
On Linux the dynamic linker (ld.so) copies the values of all linked data objects from a shared object into a private mapped region for each process. This doesn't require indirection because the address of the private mapped region is local to the module, so its address is decided when the program is linked (and in the case of position independent executables, relative addressing is used).
Why doesn't Windows do the same? Is there a situation where a DLL might be loaded more than once, and thus require multiple copies of linked data? Even if that was the case, it wouldn't be applicable to read only data.
It seems that the MSVCRT handles this issue by defining the _DLL macro when targeting the dynamic C runtime library (with the /MD or /MDd flag), then using that in all standard headers to conditionally declare all exported symbols with __declspec(dllimport). I suppose you could reuse this macro if you only supported statically linking when using the static C runtime and dynamically linking when using the dynamic C runtime.
References:
LNK4217 - Russ Keldorph's WebLog (emphasis mine)
__declspec(dllimport) can be used on both code and data, and its semantics are subtly different between the two. When applied to a routine call, it is purely a performance optimization. For data, it is required for correctness.
[...]
Importing data
If you export a data item from a DLL, you must declare it with __declspec(dllimport) in the code that accesses it. In this case, instead of generating a direct load from memory, the compiler generates a load through a pointer, resulting in one additional indirection. Unlike calls, where the linker will fix up the code correctly whether the routine was declared __declspec(dllimport) or not, accessing imported data requires __declspec(dllimport). If omitted, the code will wind up accessing the IAT entry instead of the data in the DLL, probably resulting in unexpected behavior.
Importing into an Application Using __declspec(dllimport)
Using __declspec(dllimport) is optional on function declarations, but the compiler produces more efficient code if you use this keyword. However, you must use `__declspec(dllimport) for the importing executable to access the DLL's public data symbols and objects.
Importing Data Using __declspec(dllimport)
When you mark the data as __declspec(dllimport), the compiler automatically generates the indirection code for you.
Importing Using DEF Files (interesting historical notes about accessing the IAT directly)
How do I share data in my DLL with an application or with other DLLs?
By default, each process using a DLL has its own instance of all the DLLs global and static variables.
Linker Tools Warning LNK4217
What happens when you get dllimport wrong? (seems to be unaware of data semantics)
How do I export data from a DLL?
CRT Library Features (documents the _DLL macro)
Linux and Windows use different strategies for accessing data stored in dynamic libraries.
On Linux, an undefined reference to an object is resolved to a library at link time. The linker finds the size of the object and reserves space for it in the .bss or the .rdata segment of the executable. When executed, the dynamic linker (ld.so) resolves the symbol to a dynamic library (again), and copies the object from the dynamic library to the process's memory.
On Windows, an undefined reference to an object is resolved to an import library at link time, and no space is reserved for it. When the module is executed, the dynamic linker resolves the symbol to a dynamic library, and creates a copy on write memory map in the process, backed by a shared data segment in the dynamic library.
The advantage of a copy on write memory map is that if the linked data is unchanged, then it can be shared with other processes. In practice this is a trifling benefit which greatly increases complexity, both for the toolchain and programs using dynamic libraries. For objects which are actually written this is always less efficient.
I suspect, although I have no evidence, that this decision was made for a particular and now outdated use case. Perhaps it was common practice to use large (for the time) read only objects in dynamic libraries on 16-bit Windows (in official Microsoft programs or otherwise). Either way, I doubt anyone at Microsoft has the expertise and time to change it now.
In order to investigate the issue I created a program which writes to an object from a dynamic library. It writes one byte per page (4096 bytes) in the object, then writes the entire object, then retries the initial one byte per page write. If the object is reserved for the process before main is called, the first and third loops should take approximately the same time, and the second loop should take longer than both. If the object is a copy on write map to a dynamic library, the first loop should take at least as long as the second, and the third should take less time than both.
The results are consistent with my hypothesis, and analyzing the disassembly confirms that Linux accesses the dynamic library data at a link time address, relative to the program counter. Surprisingly, Windows not only indirectly accesses the data, the pointer to the data and its length are reloaded from the import address table every loop iteration, with optimizations enabled. This was tested with Visual Studio 2010 on Windows XP, so maybe things have changed, although I wouldn't think that it has.
Here are the results for Linux:
$ dd bs=1M count=16 if=/dev/urandom of=libdat.dat
$ xxd -i libdat.dat libdat.c
$ gcc -O3 -g -shared -fPIC libdat.c -o libdat.so
$ gcc -O3 -g -no-pie -L. -ldat dat.c -o dat
$ LD_LIBRARY_PATH=. ./dat
local = 0x1601060
libdat_dat = 0x601040
libdat_dat_len = 0x601020
dirty= 461us write= 12184us retry= 456us
$ nm dat
[...]
0000000000601040 B libdat_dat
0000000000601020 B libdat_dat_len
0000000001601060 B local
[...]
$ objdump -d -j.text dat
[...]
400693: 8b 35 87 09 20 00 mov 0x200987(%rip),%esi # 601020 <libdat_dat_len>
[...]
4006a3: 31 c0 xor %eax,%eax # zero loop counter
4006a5: 48 8d 15 94 09 20 00 lea 0x200994(%rip),%rdx # 601040 <libdat_dat>
4006ac: 0f 1f 40 00 nopl 0x0(%rax) # align loop for efficiency
4006b0: 89 c1 mov %eax,%ecx # store data offset in ecx
4006b2: 05 00 10 00 00 add $0x1000,%eax # add PAGESIZE to data offset
4006b7: c6 04 0a 00 movb $0x0,(%rdx,%rcx,1) # write a zero byte to data
4006bb: 39 f0 cmp %esi,%eax # test loop condition
4006bd: 72 f1 jb 4006b0 <main+0x30> # continue loop if data is left
[...]
Here are the results for Windows:
$ cl /Ox /Zi /LD libdat.c /link /EXPORT:libdat_dat /EXPORT:libdat_dat_len
[...]
$ cl /Ox /Zi dat.c libdat.lib
[...]
$ dat.exe # note low resolution timer means retry is too small to measure
local = 0041EEA0
libdat_dat = 1000E000
libdat_dat_len = 1100E000
dirty= 20312us write= 3125us retry= 0us
$ dumpbin /symbols dat.exe
[...]
9000 .data
1000 .idata
5000 .rdata
1000 .reloc
17000 .text
[...]
$ dumpbin /disasm dat.exe
[...]
004010BA: 33 C0 xor eax,eax # zero loop counter
[...]
004010C0: 8B 15 8C 63 42 00 mov edx,dword ptr [__imp__libdat_dat] # store data pointer in edx
004010C6: C6 04 02 00 mov byte ptr [edx+eax],0 # write a zero byte to data
004010CA: 8B 0D 88 63 42 00 mov ecx,dword ptr [__imp__libdat_dat_len] # store data length in ecx
004010D0: 05 00 10 00 00 add eax,1000h # add PAGESIZE to data offset
004010D5: 3B 01 cmp eax,dword ptr [ecx] # test loop condition
004010D7: 72 E7 jb 004010C0 # continue loop if data is left
[...]
Here is the source code used for both tests:
#include <stdio.h>
#ifdef _WIN32
#include <windows.h>
typedef FILETIME time_l;
time_l time_get(void) {
FILETIME ret; GetSystemTimeAsFileTime(&ret); return ret;
}
long long int time_diff(time_l const *c1, time_l const *c2) {
return 1LL*c2->dwLowDateTime/100-c1->dwLowDateTime/100+c2->dwHighDateTime*100000-c1->dwHighDateTime*100000;
}
#else
#include <unistd.h>
#include <time.h>
#include <stdlib.h>
typedef struct timespec time_l;
time_l time_get(void) {
time_l ret; clock_gettime(CLOCK_MONOTONIC, &ret); return ret;
}
long long int time_diff(time_l const *c1, time_l const *c2) {
return 1LL*c2->tv_nsec/1000-c1->tv_nsec/1000+c2->tv_sec*1000000-c1->tv_sec*1000000;
}
#endif
#ifndef PAGESIZE
#define PAGESIZE 4096
#endif
#ifdef _WIN32
#define DLLIMPORT __declspec(dllimport)
#else
#define DLLIMPORT
#endif
extern DLLIMPORT unsigned char volatile libdat_dat[];
extern DLLIMPORT unsigned int libdat_dat_len;
unsigned int local[4096];
int main(void) {
unsigned int i;
time_l t1, t2, t3, t4;
long long int d1, d2, d3;
t1 = time_get();
for(i=0; i < libdat_dat_len; i+=PAGESIZE) {
libdat_dat[i] = 0;
}
t2 = time_get();
for(i=0; i < libdat_dat_len; i++) {
libdat_dat[i] = 0xFF;
}
t3 = time_get();
for(i=0; i < libdat_dat_len; i+=PAGESIZE) {
libdat_dat[i] = 0;
}
t4 = time_get();
d1 = time_diff(&t1, &t2);
d2 = time_diff(&t2, &t3);
d3 = time_diff(&t3, &t4);
printf("%-15s= %18p\n%-15s= %18p\n%-15s= %18p\n", "local", local, "libdat_dat", libdat_dat, "libdat_dat_len", &libdat_dat_len);
printf("dirty=%9lldus write=%9lldus retry=%9lldus\n", d1, d2, d3);
return 0;
}
I sincerely hope someone else benefits from my research. Thanks for reading!
I have this small testcode atfork_demo.c:
#include <stdio.h>
#include <pthread.h>
void hello_from_fork_prepare() {
printf("Hello from atfork prepare.\n");
fflush(stdout);
}
void register_hello_from_fork_prepare() {
pthread_atfork(&hello_from_fork_prepare, 0, 0);
}
Now, I compile it in two different ways:
gcc -shared -fPIC atfork_demo.c -o atfork_demo1.so
gcc -shared -fPIC atfork_demo.c -o atfork_demo2.so -lpthread
My demo main atfork_demo_main.c is this:
#include <dlfcn.h>
#include <stdio.h>
#include <unistd.h>
int main(int argc, const char** argv) {
if(argc <= 1) {
printf("usage: ... lib.so\n");
return 1;
}
void* plib = dlopen("libpthread.so.0", RTLD_NOW|RTLD_GLOBAL);
if(!plib) {
printf("cannot load pthread, error %s\n", dlerror());
return 1;
}
void* lib = dlopen(argv[1], RTLD_LAZY);
if(!lib) {
printf("cannot load %s, error %s\n", argv[1], dlerror());
return 1;
}
void (*reg)();
reg = dlsym(lib, "register_hello_from_fork_prepare");
if(!reg) {
printf("did not found func, error %s\n", dlerror());
return 1;
}
reg();
fork();
}
Which I compile like this:
gcc atfork_demo_main.c -o atfork_demo_main.exec -ldl
Now, I have another small demo atfork_patch.c where I want to override pthread_atfork:
#include <stdio.h>
int pthread_atfork(void (*prepare)(void), void (*parent)(void), void (*child)(void)) {
printf("Ignoring pthread_atfork call!\n");
fflush(stdout);
return 0;
}
Which I compile like this:
gcc -shared -O2 -fPIC patch_atfork.c -o patch_atfork.so
And then I set LD_PRELOAD=./atfork_patch.so, and do these two calls:
./atfork_demo_main.exec ./atfork_demo1.so
./atfork_demo_main.exec ./atfork_demo2.so
In the first case, the LD_PRELOAD-override of pthread_atfork worked and in the second, it did not. I get the output:
Ignoring pthread_atfork call!
Hello from atfork prepare.
So, now to the question(s):
Why did it not work in the second case?
How can I make it work also in the second case, i.e. also override it?
In my real use case, atfork_demo is some library which I cannot change. I also cannot change atfork_demo_main but I can make it load any other code. I would prefer if I can just do it with some change in atfork_patch.
You get some more debug output if you also use LD_DEBUG=all. Maybe interesting is this bit, for the second case:
841: symbol=__register_atfork; lookup in file=./atfork_demo_main.exec [0]
841: symbol=__register_atfork; lookup in file=./atfork_patch_extended.so [0]
841: symbol=__register_atfork; lookup in file=/lib/x86_64-linux-gnu/libdl.so.2 [0]
841: symbol=__register_atfork; lookup in file=/lib/x86_64-linux-gnu/libc.so.6 [0]
841: binding file ./atfork_demo2.so [0] to /lib/x86_64-linux-gnu/libc.so.6 [0]: normal symbol `__register_atfork' [GLIBC_2.3.2]
So, it searches for the symbol __register_atfork. I added that to atfork_patch_extended.so but it doesn't find it and uses it from libc instead. How can I make it find and use my __register_atfork?
As a side note, my main goal is to ignore the atfork handlers when fork() is called, but this is not the question here, but actually here. One solution to that, which seems to work, is to override fork() itself by this:
pid_t fork(void) {
return syscall(SYS_clone, SIGCHLD, 0);
}
Before answering this question, I would stress that this is a really bad idea for any production application.
If you are using a third party library that puts such constraints in place, then think about an alternative solution, such as forking early to maintain a "helper" process, with a pipe between you and it... then, when you need to call exec(), you can request that it does the work (fork(), exec()) on your behalf.
Patching or otherwise side-stepping the services of a system call such as pthread_atfork() is just asking for trouble (missed events, memory leaks, crashes, etc...).
As #Sergio pointed out, pthread_atfork() is actually built into atfork_demo2.so, so you can't do anything to override it... However examining the disassembly / source of pthread_atfork() gives you a decent hint about how achieve what you're asking:
0000000000000830 <__pthread_atfork>:
830: 48 8d 05 f9 07 20 00 lea 0x2007f9(%rip),%rax # 201030 <__dso_handle>
837: 48 85 c0 test %rax,%rax
83a: 74 0c je 848 <__pthread_atfork+0x18>
83c: 48 8b 08 mov (%rax),%rcx
83f: e9 6c fe ff ff jmpq 6b0 <__register_atfork#plt>
844: 0f 1f 40 00 nopl 0x0(%rax)
848: 31 c9 xor %ecx,%ecx
84a: e9 61 fe ff ff jmpq 6b0 <__register_atfork#plt>
or the source (from here):
int
pthread_atfork (void (*prepare) (void),
void (*parent) (void),
void (*child) (void))
{
return __register_atfork (prepare, parent, child, &__dso_handle == NULL ? NULL : __dso_handle);
}
As you can see, pthread_atfork() does nothing aside from calling __register_atfork()... so patch that instead!
The content of atfork_patch.c now becomes: (using __register_atfork()'s prototype, from here / here)
#include <stdio.h>
int __register_atfork (void (*prepare) (void), void (*parent) (void),
void (*child) (void), void *dso_handle) {
printf("Ignoring pthread_atfork call!\n");
fflush(stdout);
return 0;
}
This works for both demos:
$ LD_PRELOAD=./atfork_patch.so ./atfork_demo_main.exec ./atfork_demo1.so
Ignoring pthread_atfork call!
$ LD_PRELOAD=./atfork_patch.so ./atfork_demo_main.exec ./atfork_demo2.so
Ignoring pthread_atfork call!
It doesn't work for the second case because there is nothing to override. Your second library is linked statically with pthread library:
$ readelf --symbols atfork_demo1.so | grep pthread_atfork
7: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND pthread_atfork
54: 0000000000000000 0 NOTYPE GLOBAL DEFAULT UND pthread_atfork
$ readelf --symbols atfork_demo2.so | grep pthread_atfork
41: 0000000000000000 0 FILE LOCAL DEFAULT ABS pthread_atfork.c
47: 0000000000000830 31 FUNC LOCAL DEFAULT 12 __pthread_atfork
49: 0000000000000830 31 FUNC LOCAL DEFAULT 12 pthread_atfork
So it will use local pthread_atfork each time, regardless of LD_PRELOAD or any other loaded libraries.
How to overcome that? Looks like for described configuration it is not possible since you need to modify atfork_demo library or main executable anyway.
I wonder if it's possible for a linux process to call code located in the memory of another process?
Let's say we have a function f() in process A and we want process B to call it. What I thought about is using mmap with MAP_SHARED and PROT_EXEC flags to map the memory containing the function code and pass the pointer to B, assuming, that f() will not call any other function from A binary. Will it ever work? If yes, then how do I determine the size of f() in memory?
=== EDIT ===
I know, that shared libraries will do exactly that, but I wonder if it's possible to dynamically share code between processes.
Yes, you can do that, but the first process must have first created the shared memory via mmap and either a memory-mapped file, or a shared area created with shm_open.
If you are sharing compiled code then that's what shared libraries were created for. You can link against them in the ordinary way and the sharing will happen automatically, or you can load them manually using dlopen (e.g. for a plugin).
Update:
As the code has been generated by a compiler then you will have relocations to worry about. The compiler does not produce code that will Just Work anywhere. It expects that the .data section is in a certain place, and that the .bss section has been zeroed. The GOT will need to be populated. Any static constructors will have to be called.
In short, what you want is probably dlopen. This system allows you to open a shared library like it was a file, and then extract function pointers by name. Each program that dlopens the library will share the code sections, thus saving memory, but each will have its own copy of the data section, so they do not interfere with each other.
Beware that you need to compile your library code with -fPIC or else you won't get any code sharing either (actually, the linkers and dynamic loaders for many architectures probably don't support libraries that aren't PIC anyway).
The standard approach is to put the code of f() in a shared library libfoo.so. Then you could either link to that library (e.g. by building program A with gcc -Wall a.c -lfoo -o a.bin), or load it dynamically (e.g. in program B) using dlopen(3) then retrieving the address of f using dlsym.
When you compile a shared library you want to :
compile each source file foo1.c with gcc -Wall -fPIC -c foo1.c -o foo1.pic.o into position independent code, and likewise for foo2.c into foo2.pic.o
link all of them into libfoo.so with gcc -Wall -shared foo*.pic.o -o libfoo.so ; notice that you can link additional shared libraries into lbfoo.so (e.g. by appending -lm to the linking command)
See also the Program Library Howto.
You could play insane tricks by mmap-ing some other /proc/1234/mem but that is not reasonable at all. Use shared libraries.
PS. you can dlopen a big lot (hundreds of thousands) of shared objects lib*.sofiles; you may want to dlclosethem (but practically you don't have to).
It would be possible to do so, but that's exactly what shared libraries are for.
Also, beware that you need to check that the address of the shared memory is the same for both processes, otherwise any references that are "absolute" (that is, a pointer to something in the shared code). And like with shared libaries, the bitness of the code will have to be the same, and as with all shared memory, you need to make sure that you don't "mess up" for the other process if you modify any of the shared memory.
Determining the size of a function ranges from "hard" to "nearly impossible", depending on the actual code generated, and the level of information you have available. Debug symbols will have the size of a function, but beware that I have seen compilers generate code where two functions share the same "return" piece of code (that is, the compiler generates a jump to another function that has the same bit of code to return the result, because it saves a few bytes of code, and there was already going to be a jump anyway [e.g. there is a if/else that the compiler has to jump around]).
not directly
that's what shared libraries are for
relocations
Oh no! Anyways...
Here's the insane, unreasonable, not-good, purely academic demonstration of this capability. It was fun for me, I hope it's fun for you.
Overview
Program A will use shm_open to create a shared memory object, and mmap to map it to its memory space. Then it it will copy some code from a function defined in A to the shared memory. Then program B will open up the shared memory, execute the function, and just for kicks, make a very simple modification to the code. Then A will execute the code to demonstrate the change took effect.
Again, this is no recommendation for how to solve a problem, it's an academic demonstration.
// A.c
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
int foo(int y) {
int x = 14;
return x + y;
}
int main(int argc, char *argv[]) {
const size_t mem_size = 0x1000;
// create shared memory objects
int shared_fd = shm_open("foobar2", O_RDWR | O_CREAT, 0777);
ftruncate(shared_fd, mem_size);
void *shared_mem =
mmap(NULL, mem_size, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_SHARED, shared_fd, 0);
// copy function to shared memory
const size_t fn_size = 24;
memcpy(shared_mem, &foo, fn_size);
// wait
getc(stdin);
// execute the shared function
int(*shared_foo)(int) = shared_mem;
printf("shared_foo(3) = %d\n", shared_foo(3));
// clean up
shm_unlink("foobar2");
}
Note the use of PROT_READ | PROT_WRITE | PROT_EXEC in the call to mmap. This program is compiled with
gcc A.c -lrt -o A
The constant fn_size was determined by looking at the output of objdump -dj .text A
...
000000000000088a <foo>:
88a: 55 push %rbp
88b: 48 89 e5 mov %rsp,%rbp
88e: 89 7d ec mov %edi,-0x14(%rbp)
891: c7 45 fc 0e 00 00 00 movl $0xe,-0x4(%rbp)
898: 8b 55 fc mov -0x4(%rbp),%edx
89b: 8b 45 ec mov -0x14(%rbp),%eax
89e: 01 d0 add %edx,%eax
8a0: 5d pop %rbp
8a1: c3 retq
...
I think that's 24 bytes, I dunno. I guess I could put anything larger than that and it would do the same thing. Anything shorter and I'll probably get an exception from the processor. Also, note that the value of x from foo (14, that's (apparently) 0e 00 00 00 in LE) is located at foo + 10. This will be the constant x_offset in program B.
// B.c
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
const int x_offset = 10;
int main(int argc, char *argv[]) {
// create shared memory objects
int shared_fd = shm_open("foobar2", O_RDWR | O_CREAT, 0777);
void *shared_mem = mmap(NULL, 0x1000, PROT_EXEC | PROT_WRITE, MAP_SHARED, shared_fd, 0);
int (*shared_foo)(int) = shared_mem;
int z = shared_foo(13);
printf("result: %d\n", z);
int *x_p = (int*)((char*)shared_mem + x_offset);
*x_p = 100;
shm_unlink("foobar");
}
Anyways first I run A, then I run B. The output of B is:
result: 27
Then I go back to A and push enter, then I get:
shared_foo(3) = 103
Good enough for me.
/dev/shm/foobar2
To completely eliminate the mystique of all this, after running A you can do something like
xxd /dev/shm/foobar2 | vim -
Then, edit that constant 0e 00 00 00 just like before, then save the file with the 'ol
:w !xxd -r > /dev/shm/foobar2
and push enter in A and see similar results as above.
Let's say that I have a function that gets called in multiple parts of a program. Let's also say that I have a particular call to that function that is in an extremely performance-sensitive section of code (e.g., a loop that iterates tens of millions of times and where each microsecond counts). Is there a way that I can force the complier (gcc in my case) to inline that single, particular function call, without inlining the others?
EDIT: Let me make this completely clear: this question is NOT about forcing gcc (or any other compiler) to inline all calls to a function; rather, it it about requesting that the compiler inline a particular call to a function.
In C (as opposed to C++) there's no standard way to suggest that a function should be inlined. It's only vender-specific extensions.
However you specify it, as far as I know the compiler will always try to inline every instance, so use that function only once:
original:
int MyFunc() { /* do stuff */ }
change to:
inline int MyFunc_inlined() { /* do stuff */ }
int MyFunc() { return MyFunc_inlined(); }
Now, in theplaces where you want it inlined, use MyFunc_inlined()
Note: "inline" keyword in the above is just a placeholder for whatever syntax gcc uses to force an inlining. If H2CO3's deleted answer is to be trusted, that would be:
static inline __attribute__((always_inline)) int MyFunc_inlined() { /* do stuff */ }
It is possible to enable inlining per translation unit (but not per call). Though this is not an answer for the question and is an ugly trick, it conforms to C standard and may be interesting as related stuff.
The trick is to use extern definition where you do not want to inline, and extern inline where you need inlining.
Example:
$ cat func.h
int func();
$ cat func.c
int func() { return 10; }
$ cat func_inline.h
extern inline int func() { return 5; }
$ cat main.c
#include <stdio.h>
#ifdef USE_INLINE
# include "func_inline.h"
#else
# include "func.h"
#endif
int main() { printf("%d\n", func()); return 0; }
$ gcc main.c func.c && ./a.out
10 // non-inlined version
$ gcc main.c func.c -DUSE_INLINE && ./a.out
10 // non-inlined version
$ gcc main.c func.c -DUSE_INLINE -O2 && ./a.out
5 // inlined!
You can also use non-standard attribute (e.g. __attribute__(always_inline)) in GCC) for extern inline definition, instead of relying on -O2.
BTW, the trick is used in glibc.
the traditional way to force inline a function in C was to not use a function at all, but to use a function like macro. This method will always inline the function, but there are some problems with function like macros. For example:
#define ADD(x, y) ((x) + (y))
printf("%d\n", ADD(2, 2));
There is also the inline keyword, which was added to C in the C99 standard. Notably, Microsoft's Visual C compiler doesn't support C99, and thus you can't use inline with that (miserable) compiler. Inline only hints to the compiler that you want the function inlined - it does not guarantee it.
GCC has an extension which requires the compiler to inline the function.
inline __attribute__((always_inline)) int add(int x, int y) {
return x + y;
}
To make this cleaner, you may want want to use a macro:
#define ALWAYS_INLINE inline __attribute__((always_inline))
ALWAYS_INLINE int add(int x, int y) {
return x + y;
}
I don't know of a direct way of having a function that can be force inlined on certain calls. But you can combine the techniques like this:
#define ALWAYS_INLINE inline __attribute__((always_inline))
#define ADD(x, y) ((x) + (y))
ALWAYS_INLINE int always_inline_add(int x, int y) {
return ADD(x, y);
}
int normal_add(int x, int y) {
return ADD(x, y);
}
Or, you could just have this:
#define ADD(x, y) ((x) + (y))
int add(int x, int y) {
return ADD(x, y);
}
int main() {
printf("%d\n", ADD(2,2)); // always inline
printf("%d\n", add(2,2)); // normal function call
return 0;
}
Also, note that forcing the inline of a function might not make your code faster. Inline functions cause larger code to be generated, which might cause more cache misses to occur.
I hope that helps.
The answer is it depends on your function, what you request and the nature of your function. Your best bet is to:
tell the compiler you want it inlined
make the function static (be careful with extern as it's semantics change a little in gcc in some modes)
set the compiler options to inform the optimizer you want inlining, and set inline limits appropriately
turn on any couldn't inline warnings on the compiler
verify the output (you could check the assembler generated) that the function is in-lined.
Compiler hints
The answers here cover just one side of inlining, the language hints to the compiler. When the standard says:
Making a function an inline function suggests that calls to the function be as
fast as possible. The extent to which such suggestions are effective is
implementation-defined
This can be the case for other stronger hints such as:
GNU's __attribute__((always_inline)): Generally, functions are not inlined unless optimization is specified. For functions declared inline, this attribute inlines the function even if no optimization level was specified.
Microsoft's __forceinline: The __forceinline keyword overrides the cost/benefit analysis and relies on the judgment of the programmer instead. Exercise caution when using __forceinline. Indiscriminate use of __forceinline can result in larger code with only marginal performance gains or, in some cases, even performance losses (due to increased paging of a larger executable, for example).
Even both of these would rely on the inlining being possible, and crucially on compiler flags. To work with inlined functions you also need to understand the optimisation settings of your compiler.
It may be worth saying inlining can also be used to provide replacements for existing functions just for the compilation unit you are in. This can be used when an approximate answers are good enough for your algorithm, or a result can be achieved in a faster way with local data-structures.
An inline definition
provides an alternative to an external definition, which a translator may use to implement
any call to the function in the same translation unit. It is unspecified whether a call to the
function uses the inline definition or the external definition.
Some functions cannot be inlined
For example, for the GNU compiler functions that cannot be inlined are:
Note that certain usages in a function definition can make it unsuitable for inline substitution. Among these usages are: variadic functions, use of alloca, use of variable-length data types (see Variable Length), use of computed goto (see Labels as Values), use of nonlocal goto, and nested functions (see Nested Functions). Using -Winline warns when a function marked inline could not be substituted, and gives the reason for the failure.
So even always_inline may not do what you expect.
Compiler Options
Using C99's inline hints will rely on you instructing the compiler the inline behavour you are looking for.
GCC for instance has:
-fno-inline, -finline-small-functions, -findirect-inlining, -finline-functions, -finline-functions-called-once, -fearly-inlining, -finline-limit=n
Microsoft compiler also has options that dictate the effectiveness of inline. Some compilers will also allow optimization to take into account running profile.
I do think it's worth seeing inlining in the broader context of program optimization.
Preventing Inlining
You mention that you don't want certain functions inlined. This might be done by setting something like __attribute__((always_inline)) without turning on the optimizer. However you would probably would want the optimizer. One option here would be to hint you don't want it: __attribute__ ((noinline)). But why would this be the case?
Other forms of optimization
You may also consider how you might restructure your loop and avoiding branches. Branch prediction can have a dramatic effect. For an interesting discussion on this see: Why is it faster to process a sorted array than an unsorted array?
Then you also might smaller inner loops to be unrolled and to look at invariants.
There's a kernel source that uses #defines in a very interesting way to define several different named functions with the same body. This solves the problem of having two different functions to maintain. (I forgot which one it was...). My idea is based on this same principle.
The way to use the defines is that you'll define the inline function on the compilation unit you need it. To demonstrate the method I'll use a simple function:
int add(int a, int b);
It works like this: you make a function generator #define in a header file and declare the function prototype of the normal version of the function (the one not inlined).
Then you declare two separate function generators, one for the normal function and one for the inline function. The inline function you declare as static __inline__. When you need to call the inline function in one of your files, you use the generator define to get the source for it. In all other files you need to use the normal function, you just include the header with the prototype.
The code was tested on:
Intel(R) Core(TM) i5-3330 CPU # 3.00GHz
Kernel Version: 3.16.0-49-generic
GCC 4.8.4
Code is worth more than a thousand words, so:
File Hierarchy
+
| Makefile
| add.h
| add.c
| loop.c
| loop2.c
| loop3.c
| loops.h
| main.c
add.h
#define GENERATE_ADD(type, prefix) \
type int prefix##add(int a, int b) { return a + b; }
#define DEFINE_ADD() GENERATE_ADD(,)
#define DEFINE_INLINE_ADD() GENERATE_ADD(static __inline__, inline_)
int add(int, int);
This doesn't look nice, but cuts the work of maintaining two different functions. The function is fully defined within the GENERATE_ADD(type,prefix) macro, so if you ever need to change the function, you change this macro and everything else changes.
Next, DEFINE_ADD() will be called from add.c to generate the normal version of add. DEFINE_INLINE_ADD() will give you access to a function called inline_add, which has the same signature as your normal addfunction, but it has a different name (the inline_ prefix).
Note: I didn't use the __attribute((always_inline))__ when using the -O3 flag - the __inline__ did the job. However, if you don't wanna use -O3, use:
#define DEFINE_INLINE_ADD() GENERATE_ADD(static __inline__ __attribute__((always_inline)), inline_)
add.c
#include "add.h"
DEFINE_ADD()
Simple call to the DEFINE_ADD() macro generator. This will declare the normal version of the function (the one that won't get inlined).
loop.c
#include <stdio.h>
#include "add.h"
DEFINE_INLINE_ADD()
int loop(void)
{
register int i;
for (i = 0; i < 100000; i++)
printf("%d\n", inline_add(i + 1, i + 2));
return 0;
}
Here in loop.c you can see the call to DEFINE_INLINE_ADD(). This gives this function access to the inline_add function. When you compile, all inline_add function will be inlined.
loop2.c
#include <stdio.h>
#include "add.h"
int loop2(void)
{
register int i;
for (i = 0; i < 100000; i++)
printf("%d\n", add(i + 1, i + 2));
return 0;
}
This is to show you can use the normal version of add normally from other files.
loop3.c
#include <stdio.h>
#include "add.h"
DEFINE_INLINE_ADD()
int loop3(void)
{
register int i;
printf ("add: %d\n", add(2,3));
printf ("add: %d\n", add(4,5));
for (i = 0; i < 100000; i++)
printf("%d\n", inline_add(i + 1, i + 2));
return 0;
}
This is to show that you can use both the functions in the same compilation unit, yet one of the functions will be inlined, and the other wont (see GDB disass bellow for details).
loops.h
/* prototypes for main */
int loop (void);
int loop2 (void);
int loop3 (void);
main.c
#include <stdio.h>
#include <stdlib.h>
#include "add.h"
#include "loops.h"
int main(void)
{
printf("%d\n", add(1,2));
printf("%d\n", add(2,3));
loop();
loop2();
loop3();
return 0;
}
Makefile
CC=gcc
CFLAGS=-Wall -pedantic --std=c11
main: add.o loop.o loop2.o loop3.o main.o
${CC} -o $# $^ ${CFLAGS}
add.o: add.c
${CC} -c $^ ${CFLAGS}
loop.o: loop.c
${CC} -c $^ -O3 ${CFLAGS}
loop2.o: loop2.c
${CC} -c $^ ${CFLAGS}
loop3.o: loop3.c
${CC} -c $^ -O3 ${CFLAGS}
If you use the __attribute__((always_inline)) you can change the Makefile to:
CC=gcc
CFLAGS=-Wall -pedantic --std=c11
main: add.o loop.o loop2.o loop3.o main.o
${CC} -o $# $^ ${CFLAGS}
%.o: %.c
${CC} -c $^ ${CFLAGS}
Compilation
$ make
gcc -c add.c -Wall -pedantic --std=c11
gcc -c loop.c -O3 -Wall -pedantic --std=c11
gcc -c loop2.c -Wall -pedantic --std=c11
gcc -c loop3.c -O3 -Wall -pedantic --std=c11
gcc -Wall -pedantic --std=c11 -c -o main.o main.c
gcc -o main add.o loop.o loop2.o loop3.o main.o -Wall -pedantic --std=c11
Disassembly
$ gdb main
(gdb) disass add
0x000000000040059d <+0>: push %rbp
0x000000000040059e <+1>: mov %rsp,%rbp
0x00000000004005a1 <+4>: mov %edi,-0x4(%rbp)
0x00000000004005a4 <+7>: mov %esi,-0x8(%rbp)
0x00000000004005a7 <+10>:mov -0x8(%rbp),%eax
0x00000000004005aa <+13>:mov -0x4(%rbp),%edx
0x00000000004005ad <+16>:add %edx,%eax
0x00000000004005af <+18>:pop %rbp
0x00000000004005b0 <+19>:retq
(gdb) disass loop
0x00000000004005c0 <+0>: push %rbx
0x00000000004005c1 <+1>: mov $0x3,%ebx
0x00000000004005c6 <+6>: nopw %cs:0x0(%rax,%rax,1)
0x00000000004005d0 <+16>:mov %ebx,%edx
0x00000000004005d2 <+18>:xor %eax,%eax
0x00000000004005d4 <+20>:mov $0x40079d,%esi
0x00000000004005d9 <+25>:mov $0x1,%edi
0x00000000004005de <+30>:add $0x2,%ebx
0x00000000004005e1 <+33>:callq 0x4004a0 <__printf_chk#plt>
0x00000000004005e6 <+38>:cmp $0x30d43,%ebx
0x00000000004005ec <+44>:jne 0x4005d0 <loop+16>
0x00000000004005ee <+46>:xor %eax,%eax
0x00000000004005f0 <+48>:pop %rbx
0x00000000004005f1 <+49>:retq
(gdb) disass loop2
0x00000000004005f2 <+0>: push %rbp
0x00000000004005f3 <+1>: mov %rsp,%rbp
0x00000000004005f6 <+4>: push %rbx
0x00000000004005f7 <+5>: sub $0x8,%rsp
0x00000000004005fb <+9>: mov $0x0,%ebx
0x0000000000400600 <+14>:jmp 0x400625 <loop2+51>
0x0000000000400602 <+16>:lea 0x2(%rbx),%edx
0x0000000000400605 <+19>:lea 0x1(%rbx),%eax
0x0000000000400608 <+22>:mov %edx,%esi
0x000000000040060a <+24>:mov %eax,%edi
0x000000000040060c <+26>:callq 0x40059d <add>
0x0000000000400611 <+31>:mov %eax,%esi
0x0000000000400613 <+33>:mov $0x400794,%edi
0x0000000000400618 <+38>:mov $0x0,%eax
0x000000000040061d <+43>:callq 0x400470 <printf#plt>
0x0000000000400622 <+48>:add $0x1,%ebx
0x0000000000400625 <+51>:cmp $0x1869f,%ebx
0x000000000040062b <+57>:jle 0x400602 <loop2+16>
0x000000000040062d <+59>:mov $0x0,%eax
0x0000000000400632 <+64>:add $0x8,%rsp
0x0000000000400636 <+68>:pop %rbx
0x0000000000400637 <+69>:pop %rbp
0x0000000000400638 <+70>:retq
(gdb) disass loop3
0x0000000000400640 <+0>: push %rbx
0x0000000000400641 <+1>: mov $0x3,%esi
0x0000000000400646 <+6>: mov $0x2,%edi
0x000000000040064b <+11>:mov $0x3,%ebx
0x0000000000400650 <+16>:callq 0x40059d <add>
0x0000000000400655 <+21>:mov $0x400798,%esi
0x000000000040065a <+26>:mov %eax,%edx
0x000000000040065c <+28>:mov $0x1,%edi
0x0000000000400661 <+33>:xor %eax,%eax
0x0000000000400663 <+35>:callq 0x4004a0 <__printf_chk#plt>
0x0000000000400668 <+40>:mov $0x5,%esi
0x000000000040066d <+45>:mov $0x4,%edi
0x0000000000400672 <+50>:callq 0x40059d <add>
0x0000000000400677 <+55>:mov $0x400798,%esi
0x000000000040067c <+60>:mov %eax,%edx
0x000000000040067e <+62>:mov $0x1,%edi
0x0000000000400683 <+67>:xor %eax,%eax
0x0000000000400685 <+69>:callq 0x4004a0 <__printf_chk#plt>
0x000000000040068a <+74>:nopw 0x0(%rax,%rax,1)
0x0000000000400690 <+80>:mov %ebx,%edx
0x0000000000400692 <+82>:xor %eax,%eax
0x0000000000400694 <+84>:mov $0x40079d,%esi
0x0000000000400699 <+89>:mov $0x1,%edi
0x000000000040069e <+94>:add $0x2,%ebx
0x00000000004006a1 <+97>:callq 0x4004a0 <__printf_chk#plt>
0x00000000004006a6 <+102>:cmp $0x30d43,%ebx
0x00000000004006ac <+108>:jne 0x400690 <loop3+80>
0x00000000004006ae <+110>:xor %eax,%eax
0x00000000004006b0 <+112>:pop %rbx
0x00000000004006b1 <+113>:retq
Symbol table
$ objdump -t main | grep add
0000000000000000 l df *ABS* 0000000000000000 add.c
000000000040059d g F .text 0000000000000014 add
$ objdump -t main | grep loop
0000000000000000 l df *ABS* 0000000000000000 loop.c
0000000000000000 l df *ABS* 0000000000000000 loop2.c
0000000000000000 l df *ABS* 0000000000000000 loop3.c
00000000004005c0 g F .text 0000000000000032 loop
00000000004005f2 g F .text 0000000000000047 loop2
0000000000400640 g F .text 0000000000000072 loop3
$ objdump -t main | grep main
main: file format elf64-x86-64
0000000000000000 l df *ABS* 0000000000000000 main.c
0000000000000000 F *UND* 0000000000000000 __libc_start_main##GLIBC_2.2.5
00000000004006b2 g F .text 000000000000005a main
$ objdump -t main | grep inline
$
Well, that's it. After 3 hours of banging my head in the keyboard trying to figure it out, this was the best I could come up with. Feel free to point any errors, I'll really appreciate it. I got really interested in this particular inline one function call.
If you do not mind having two names for the same function, you could create a small wrapper around your function to "block" the always_inline attribute from affecting every call. In my example, loop_inlined would be the name you would use in performance-critical sections, while the plain loop would be used everywhere else.
inline.h
#include <stdlib.h>
static inline int loop_inlined() __attribute__((always_inline));
int loop();
static inline int loop_inlined() {
int n = 0, i;
for(i = 0; i < 10000; i++)
n += rand();
return n;
}
inline.c
#include "inline.h"
int loop() {
return loop_inlined();
}
main.c
#include "inline.h"
#include <stdio.h>
int main(int argc, char *argv[]) {
printf("%d\n", loop_inlined());
printf("%d\n", loop());
return 0;
}
This works regardless of the optimization level. Compiling with gcc inline.c main.c on Intel gives:
4011e6: c7 44 24 18 00 00 00 movl $0x0,0x18(%esp)
4011ed: 00
4011ee: eb 0e jmp 4011fe <_main+0x2e>
4011f0: e8 5b 00 00 00 call 401250 <_rand>
4011f5: 01 44 24 1c add %eax,0x1c(%esp)
4011f9: 83 44 24 18 01 addl $0x1,0x18(%esp)
4011fe: 81 7c 24 18 0f 27 00 cmpl $0x270f,0x18(%esp)
401205: 00
401206: 7e e8 jle 4011f0 <_main+0x20>
401208: 8b 44 24 1c mov 0x1c(%esp),%eax
40120c: 89 44 24 04 mov %eax,0x4(%esp)
401210: c7 04 24 60 30 40 00 movl $0x403060,(%esp)
401217: e8 2c 00 00 00 call 401248 <_printf>
40121c: e8 7f ff ff ff call 4011a0 <_loop>
401221: 89 44 24 04 mov %eax,0x4(%esp)
401225: c7 04 24 60 30 40 00 movl $0x403060,(%esp)
40122c: e8 17 00 00 00 call 401248 <_printf>
The first 7 instructions are the inlined call, and the regular call happens 5 instructions later.
Here's a suggestion, write the body of the code in a separate header file.
Include the header file in place where it has to be inline and into a body in a C file for other calls.
void demo(void)
{
#include myBody.h
}
importantloop
{
// code
#include myBody.h
// code
}
I assume that your function is a little one since you want to inline it, if so why don't you write it in asm?
As for inlining only a specific call to a function I don't think there exists something to do this task for you. Once a function is declared as inline and if the compiler will inline it for you it will do it everywhere it sees a call to that function.
I want to override certain function calls to various APIs for the sake of logging the calls, but I also might want to manipulate data before it is sent to the actual function.
For example, say I use a function called getObjectName thousands of times in my source code. I want to temporarily override this function sometimes because I want to change the behaviour of this function to see the different result.
I create a new source file like this:
#include <apiheader.h>
const char *getObjectName (object *anObject)
{
if (anObject == NULL)
return "(null)";
else
return "name should be here";
}
I compile all my other source as I normally would, but I link it against this function first before linking with the API's library. This works fine except I can obviously not call the real function inside my overriding function.
Is there an easier way to "override" a function without getting linking/compiling errors/warnings? Ideally I want to be able to override the function by just compiling and linking an extra file or two rather than fiddle around with linking options or altering the actual source code of my program.
With gcc, under Linux you can use the --wrap linker flag like this:
gcc program.c -Wl,-wrap,getObjectName -o program
and define your function as:
const char *__wrap_getObjectName (object *anObject)
{
if (anObject == NULL)
return "(null)";
else
return __real_getObjectName( anObject ); // call the real function
}
This will ensure that all calls to getObjectName() are rerouted to your wrapper function (at link time). This very useful flag is however absent in gcc under Mac OS X.
Remember to declare the wrapper function with extern "C" if you're compiling with g++ though.
If it's only for your source that you want to capture/modify the calls, the simplest solution is to put together a header file (intercept.h) with:
#ifdef INTERCEPT
#define getObjectName(x) myGetObjectName(x)
#endif
Then you implement the function as follows (in intercept.c which doesn't include intercept.h):
const char *myGetObjectName (object *anObject) {
if (anObject == NULL) return "(null)";
return getObjectName(anObject);
Then make sure each source file where you want to intercept the call has the following at the top:
#include "intercept.h"
When you compile with "-DINTERCEPT", all files will call your function rather than the real one, whereas your function will still call the real one.
Compiling without the "-DINTERCEPT" will prevent interception from occurring.
It's a bit trickier if you want to intercept all calls (not just those from your source) - this can generally be done with dynamic loading and resolution of the real function (with dlload- and dlsym-type calls) but I don't think it's necessary in your case.
You can override a function using LD_PRELOAD trick - see man ld.so. You compile shared lib with your function and start the binary (you even don't need to modify the binary!) like LD_PRELOAD=mylib.so myprog.
In the body of your function (in shared lib) you write like this:
const char *getObjectName (object *anObject) {
static char * (*func)();
if(!func)
func = (char *(*)()) dlsym(RTLD_NEXT, "getObjectName");
printf("Overridden!\n");
return(func(anObject)); // call original function
}
You can override any function from shared library, even from stdlib, without modifying/recompiling the program, so you could do the trick on programs you don't have a source for. Isn't it nice?
If you use GCC, you can make your function weak. Those can be overridden by non-weak functions:
test.c:
#include <stdio.h>
__attribute__((weak)) void test(void) {
printf("not overridden!\n");
}
int main() {
test();
}
What does it do?
$ gcc test.c
$ ./a.out
not overridden!
test1.c:
#include <stdio.h>
void test(void) {
printf("overridden!\n");
}
What does it do?
$ gcc test1.c test.c
$ ./a.out
overridden!
Sadly, that won't work for other compilers. But you can have the weak declarations that contain overridable functions in their own file, placing just an include into the API implementation files if you are compiling using GCC:
weakdecls.h:
__attribute__((weak)) void test(void);
... other weak function declarations ...
functions.c:
/* for GCC, these will become weak definitions */
#ifdef __GNUC__
#include "weakdecls.h"
#endif
void test(void) {
...
}
... other functions ...
Downside of this is that it does not work entirely without doing something to the api files (needing those three lines and the weakdecls). But once you did that change, functions can be overridden easily by writing a global definition in one file and linking that in.
You can define a function pointer as a global variable. The callers syntax would not change. When your program starts, it could check if some command-line flag or environment variable is set to enable logging, then save the function pointer's original value and replace it with your logging function. You would not need a special "logging enabled" build. Users could enable logging "in the field".
You will need to be able to modify the callers' source code, but not the callee (so this would work when calling third-party libraries).
foo.h:
typedef const char* (*GetObjectNameFuncPtr)(object *anObject);
extern GetObjectNameFuncPtr GetObjectName;
foo.cpp:
const char* GetObjectName_real(object *anObject)
{
return "object name";
}
const char* GetObjectName_logging(object *anObject)
{
if (anObject == null)
return "(null)";
else
return GetObjectName_real(anObject);
}
GetObjectNameFuncPtr GetObjectName = GetObjectName_real;
void main()
{
GetObjectName(NULL); // calls GetObjectName_real();
if (isLoggingEnabled)
GetObjectName = GetObjectName_logging;
GetObjectName(NULL); // calls GetObjectName_logging();
}
Building on #Johannes Schaub's answer with a solution suitable for code you don't own.
Alias the function you want to override to a weakly-defined function, and then reimplement it yourself.
override.h
#define foo(x) __attribute__((weak))foo(x)
foo.c
function foo() { return 1234; }
override.c
function foo() { return 5678; }
Use pattern-specific variable values in your Makefile to add the compiler flag -include override.h.
%foo.o: ALL_CFLAGS += -include override.h
Aside: Perhaps you could also use -D 'foo(x) __attribute__((weak))foo(x)' to define your macros.
Compile and link the file with your reimplementation (override.c).
This allows you to override a single function from any source file, without having to modify the code.
The downside is that you must use a separate header file for each file you want to override.
There's also a tricky method of doing it in the linker involving two stub libraries.
Library #1 is linked against the host library and exposes the symbol being redefined under another name.
Library #2 is linked against library #1, interecepting the call and calling the redefined version in library #1.
Be very careful with link orders here or it won't work.
Below are my experiments. There are 4 conclusions in the body and in the end.
Short Version
Generally speaking, to successfully override a function, you have to consider:
weak attribute
translation unit arrangement
Long Version
I have these source files.
.
├── decl.h
├── func3.c
├── main.c
├── Makefile1
├── Makefile2
├── override.c
├── test_target.c
└── weak_decl.h
main.c
#include <stdio.h>
void main (void)
{
func1();
}
test_target.c
#include <stdio.h>
void func3(void);
void func2 (void)
{
printf("in original func2()\n");
}
void func1 (void)
{
printf("in original func1()\n");
func2();
func3();
}
func3.c
#include <stdio.h>
void func3 (void)
{
printf("in original func3()\n");
}
decl.h
void func1 (void);
void func2 (void);
void func3 (void);
weak_decl.h
void func1 (void);
__attribute__((weak))
void func2 (void);
__attribute__((weak))
void func3 (void);
override.c
#include <stdio.h>
void func2 (void)
{
printf("in mock func2()\n");
}
void func3 (void)
{
printf("in mock func3()\n");
}
Makefile1:
ALL:
rm -f *.o *.a
gcc -c override.c -o override.o
gcc -c func3.c -o func3.o
gcc -c test_target.c -o test_target_weak.o -include weak_decl.h
ar cr all_weak.a test_target_weak.o func3.o
gcc main.c all_weak.a override.o -o main -include decl.h
Makefile2:
ALL:
rm -f *.o *.a
gcc -c override.c -o override.o
gcc -c func3.c -o func3.o
gcc -c test_target.c -o test_target_strong.o -include decl.h # HERE -include differs!!
ar cr all_strong.a test_target_strong.o func3.o
gcc main.c all_strong.a override.o -o main -include decl.h
Output for Makefile1 result:
in original func1()
in mock func2()
in mock func3()
Output for Makefile2:
rm *.o *.a
gcc -c override.c -o override.o
gcc -c func3.c -o func3.o
gcc -c test_target.c -o test_target_strong.o -include decl.h # -include differs!!
ar cr all_strong.a test_target_strong.o func3.o
gcc main.c all_strong.a override.o -o main -include decl.h
override.o: In function `func2':
override.c:(.text+0x0): multiple definition of `func2' <===== HERE!!!
all_strong.a(test_target_strong.o):test_target.c:(.text+0x0): first defined here
override.o: In function `func3':
override.c:(.text+0x13): multiple definition of `func3' <===== HERE!!!
all_strong.a(func3.o):func3.c:(.text+0x0): first defined here
collect2: error: ld returned 1 exit status
Makefile4:2: recipe for target 'ALL' failed
make: *** [ALL] Error 1
The symbol table:
all_weak.a:
test_target_weak.o:
0000000000000013 T func1 <=== 13 is the offset of func1 in test_target_weak.o, see below disassembly
0000000000000000 W func2 <=== func2 is [W]eak symbol with default value assigned
w func3 <=== func3 is [w]eak symbol without default value
U _GLOBAL_OFFSET_TABLE_
U puts
func3.o:
0000000000000000 T func3 <==== func3 is a strong symbol
U _GLOBAL_OFFSET_TABLE_
U puts
all_strong.a:
test_target_strong.o:
0000000000000013 T func1
0000000000000000 T func2 <=== func2 is strong symbol
U func3 <=== func3 is undefined symbol, there's no address value on the left-most column because func3 is not defined in test_target_strong.c
U _GLOBAL_OFFSET_TABLE_
U puts
func3.o:
0000000000000000 T func3 <=== func3 is strong symbol
U _GLOBAL_OFFSET_TABLE_
U puts
In both cases, the override.o symbols:
0000000000000000 T func2 <=== func2 is strong symbol
0000000000000013 T func3 <=== func3 is strong symbol
U _GLOBAL_OFFSET_TABLE_
U puts
disassembly:
test_target_weak.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <func2>: <===== HERE func2 offset is 0
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # b <func2+0xb>
b: e8 00 00 00 00 callq 10 <func2+0x10>
10: 90 nop
11: 5d pop %rbp
12: c3 retq
0000000000000013 <func1>: <====== HERE func1 offset is 13
13: 55 push %rbp
14: 48 89 e5 mov %rsp,%rbp
17: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 1e <func1+0xb>
1e: e8 00 00 00 00 callq 23 <func1+0x10>
23: e8 00 00 00 00 callq 28 <func1+0x15>
28: e8 00 00 00 00 callq 2d <func1+0x1a>
2d: 90 nop
2e: 5d pop %rbp
2f: c3 retq
So the conclusion is:
A function defined in .o file can override the same function defined in .a file. In above Makefile1, the func2() and func3() in override.o overrides the counterparts in all_weak.a. I tried with both .o files but it don't work.
For GCC, You don't need to split the functions into separate .o files as said in here for Visual Studio toolchain. We can see in above example, both func2() (in the same file as func1()) and func3() (in a separate file) can be overridden.
To override a function, when compiling its consumer's translation unit, you need to specify that function as weak. That will record that function as weak in the consumer.o. In above example, when compiling the test_target.c, which consumes func2() and func3(), you need to add -include weak_decl.h, which declares func2() and func3() as weak. The func2() is also defined in test_target.c but it's OK.
Some further experiment
Still with the above source files. But change the override.c a bit:
override.c
#include <stdio.h>
void func2 (void)
{
printf("in mock func2()\n");
}
// void func3 (void)
// {
// printf("in mock func3()\n");
// }
Here I removed the override version of func3(). I did this because I want to fall back to the original func3() implementation in the func3.c.
I still use Makefile1 to build. The build is OK. But a runtime error happens as below:
xxx#xxx-host:~/source/override$ ./main
in original func1()
in mock func2()
Segmentation fault (core dumped)
So I checked the symbols of the final main:
0000000000000696 T func1
00000000000006b3 T func2
w func3
So we can see the func3 has no valid address. That's why segment fault happens.
So why? Didn't I add the func3.o into the all_weak.a archive file?
ar cr all_weak.a func3.o test_target_weak.o
I tried the same thing with func2, where I removed the func2 implementation from ovrride.c. But this time there's no segment fault.
override.c
#include <stdio.h>
// void func2 (void)
// {
// printf("in mock func2()\n");
// }
void func3 (void)
{
printf("in mock func3()\n");
}
Output:
xxx#xxx-host:~/source/override$ ./main
in original func1()
in original func2() <====== the original func2() is invoked as a fall back
in mock func3()
My guess is, because func2 is defined in the same file/translation unit as func1. So func2 is always brought in with func1. So the linker can always resolve func2, be it from the test_target.c or override.c.
But for func3, it is defined in a separate file/translation unit (func3.c). If it is declared as weak, the consumer test_target.o will still record func3() as weak. But unfortunately the GCC linker will not check the other .o files from the same .a file to look for an implementation of func3(). Though it is indeed there.
all_weak.a:
func3.o:
0000000000000000 T func3 <========= func3 is indeed here!
U _GLOBAL_OFFSET_TABLE_
U puts
test_target_weak.o:
0000000000000013 T func1
0000000000000000 W func2
w func3
U _GLOBAL_OFFSET_TABLE_
U puts
So I must provide an override version in override.c otherwise the func3() cannot be resolved.
But I still don't know why GCC behaves like this. If someone can explain, please.
(Update 9:01 AM 8/8/2021:
this thread may explain this behavior, hopefully.)
So further conclusion is:
If you declare some symbol as weak, you'd better provide override versions of all the weak functions. Otherwise, the original version cannot be resolved unless it lives within the same file/translation unit of the caller/consumer.
You could use a shared library (Unix) or a DLL (Windows) to do this as well (would be a bit of a performance penalty). You can then change the DLL/so that gets loaded (one version for debug, one version for non-debug).
I have done a similar thing in the past (not to achieve what you are trying to achieve, but the basic premise is the same) and it worked out well.
[Edit based on OP comment]
In fact one of the reasons I want to
override functions is because I
suspect they behave differently on
different operating systems.
There are two common ways (that I know of) of dealing with that, the shared lib/dll way or writing different implementations that you link against.
For both solutions (shared libs or different linking) you would have foo_linux.c, foo_osx.c, foo_win32.c (or a better way is linux/foo.c, osx/foo.c and win32/foo.c) and then compile and link with the appropriate one.
If you are looking for both different code for different platforms AND debug -vs- release I would probably be inclined to go with the shared lib/DLL solution as it is the most flexible.