CPU dependent code: how to avoid function pointers? - c

I have performance critical code written for multiple CPUs. I detect CPU at run-time and based on that I use appropriate function for the detected CPU. So, now I have to use function pointers and call functions using these function pointers:
void do_something_neon(void);
void do_something_armv6(void);
void (*do_something)(void);
if(cpu == NEON) {
do_something = do_something_neon;
}else{
do_something = do_something_armv6;
}
//Use function pointer:
do_something();
...
Not that it matters, but I'll mention that I have optimized functions for different cpu's: armv6 and armv7 with NEON support. The problem is that by using function pointers in many places the code become slower and I'd like to avoid that problem.
Basically, at load time linker resolves relocs and patches code with function addresses. Is there a way to control better that behavior?
Personally, I'd propose two different ways to avoid function pointers: create two separate .so (or .dll) for cpu dependent functions, place them in different folders and based on detected CPU add one of these folders to the search path (or LD_LIB_PATH). The, load main code and dynamic linker will pick up required dll from the search path. The other way is to compile two separate copies of library :)
The drawback of the first method is that it forces me to have at least 3 shared objects (dll's): two for the cpu dependent functions and one for the main code that uses them. I need 3 because I have to be able to do CPU detection before loading code that uses these cpu dependent functions. The good part about the first method is that the app won't need to load multiple copies of the same code for multiple CPUs, it will load only the copy that will be used. The drawback of the second method is quite obvious, no need to talk about it.
I'd like to know if there is a way to do that without using shared objects and manually loading them at runtime. One of the ways would be some hackery that involves patching code at run-time, it's probably too complicated to get it done properly). Is there a better way to control relocations at load time? Maybe place cpu dependent functions in different sections and then somehow specify what section has priority? I think MAC's macho format has something like that.
ELF-only (for arm target) solution is enough for me, I don't really care for PE (dll's).
thanks

You may want to lookup the GNU dynamic linker extension STT_GNU_IFUNC. From Drepper's blog when it was added:
Therefore I’ve designed an ELF extension which allows to make the decision about which implementation to use once per process run. It is implemented using a new ELF symbol type (STT_GNU_IFUNC). Whenever the a symbol lookup resolves to a symbol with this type the dynamic linker does not immediately return the found value. Instead it is interpreting the value as a function pointer to a function that takes no argument and returns the real function pointer to use. The code called can be under control of the implementer and can choose, based on whatever information the implementer wants to use, which of the two or more implementations to use.
Source: http://udrepper.livejournal.com/20948.html
Nonetheless, as others have said, I think you're mistaken about the performance impact of indirect calls. All code in shared libraries will be called via a (hidden) function pointer in the GOT and a PLT entry that loads/calls that function pointer.

For the best performance you need to minimize the number of indirect calls (through pointers) per second and allow the compiler to optimize your code better (DLLs hamper this because there must be a clear boundary between a DLL and the main executable and there's no optimization across this boundary).
I'd suggest doing these:
moving as much of the main executable's code that frequently calls DLL functions into the DLL. That'll minimize the number of indirect calls per second and allow for better optimization at compile time too.
moving almost all your code into separate CPU-specific DLLs and leaving to main() only the job of loading the proper DLL OR making CPU-specific executables w/o DLLs.

Here's the exact answer that I was looking for.
GCC's __attribute__((ifunc("resolver")))
It requires fairly recent binutils.
There's a good article that describes this extension: Gnu support for CPU dispatching - sort of...

Lazy loading ELF symbols from shared libraries is described in section 1.5.5 of Ulrich Drepper's DSO How To (updated 2011-12-10). For ARM it is described in section 3.1.3 of ELF for ARM.
EDIT: With the STT_GNU_IFUNC extension mentioned by R. I forgot that was an extension. GNU Binutils supports that for ARM, apparently since March 2011, according to changelog.
If you want to call functions without the indirection of the PLT, I suggest function pointers or per-arch shared libraries inside which function calls don't go through PLTs (beware: calling an exported function is through the PLT).
I wouldn't patch the code at runtime. I mean, you can. You can add a build step: after compilation disassemble your binaries, find all offsets of calls to functions that have multi-arch alternatives, build table of patch locations, link that into your code. In main, remap the text segment writeable, patch the offsets according to the table you prepared, map it back to read-only, flush the instruction cache, and proceed. I'm sure it will work. How much performance do you expect to gain by this approach? I think loading different shared libraries at runtime is easier. And function pointers are easier still.

Related

How to circumvent dlopen() caching?

According to its man page, dlopen() will not load the same library twice:
If the same shared object is loaded again with dlopen(), the same
object handle is returned. The dynamic linker maintains reference
counts for object handles, so a dynamically loaded shared object is
not deallocated until dlclose() has been called on it as many times
as dlopen() has succeeded on it. Any initialization returns (see
below) are called just once. However, a subsequent dlopen() call
that loads the same shared object with RTLD_NOW may force symbol
resolution for a shared object earlier loaded with RTLD_LAZY.
(emphasis mine).
But what actually determines the identity of shared objects? I tried to look into the code, but did not come very far. Is it:
some form of normalized path name (e.g. realpath?)
the inode ?
the contents of the libray?
I am pretty sure that I can rule out this last point, since an actual filesystem copy yields two different handles.
To explain the motivation behind this question: I am working with some code that has static global variables. I need multiple instances of that code to run in a thread-safe manner. My current approach is to compile and link said code into a dynamic library and load that library multiple times. With some linker magic, it appears to create several copies of the globals and resolve access in each library to its own copies. The only problem is that my prototype copies the generated library n times for n concurrent uses. This is not only somewhat ugly but I also suspect that it might break on a different platform.
So what is the exact behaviour of dlopen() according to the POSIX standard?
edit: Because it came up in a comment and an answer, no refactoring the code is definitely not an option. It would involve months or even years of work and potentially sacrifice all benefits of using the code in the first place. There exists an ongoing research project that might solve this problem in a much cleaner way, but it is actual research and might fail. I need a solution now.
edit2: Because people still seem to not believe the usecase is actually valid. I am working on a pure functional language, that shall be embedded into a larger C/C++ application. Because I need a prototype with a garbage collector, a proven typechecker, and reasonable performance ASAP, I used OCaml as intermediate code. Right now, I am compiling a source module into an OCaml module, link the generated object code (including startup etc.) into a shared library with the OCaml runtime and dlopen() that shared library. Every .so has its own copy of the runtime, including several global variabels (e.g. the pointer to the young generation) and that is, or rather should be, totally fine. The library exposes exactly two functions: An initializer and a single export that does whatever the original module is intended to do. No symbols of the OCaml runtime are exported/shared. when I load the library, its internal symbols are relocated as expected, the only issue I have right now is that I actually need to copy the .so file for each instance of the job at runtime.
Regarding thread-local-storage: That is actually an interesting idea, as the modification to the runtime is indeed rather simple. But the problem is the machine code generated by the OCaml compiler, as it cannot emit loading instructions for tls symbols (yet?).
POSIX says:
Only a single copy of an object file is brought into the address space, even if dlopen() is invoked multiple times in reference to the file, and even if different pathnames are used to reference the file.
So the answer is "inode". Copying the library file "should work", but hard links won't. Except. Since they will expose the same global symbols and when that happens all (portability) bets are off. You're in the middle of weakly defined behavior that has evolved through bug fixes rather than good design.
Don't dig deeper when you're in a hole. The approach to add additional horrible hacks to make a fundamentally broken library work just leads to additional breakage. Just spend a few hours to fix the library to not use globals instead of spending days to hack around dynamic linking (which will be unportable at best).

How can I "dump" a Function to a file?

For example, I have a function func():
int func (int a, int b) {return a + b;}
Now I want write it to a file, so that I can use the system-call mmap to load it with PROT_EXEC and I can call it from another program.What should I do for it?
If you know what signature you need and a static library or the location of a shared library at compile time, you probably just want to include the header and link against the output library. If you want to invoke a function dynamically, you probably want dlopen / dlsym (UNIX) or LoadLibrary / GetProcAddress (Windows) for loading the libary dynamically and retrieving the address of the function by name.
Note that the cases where you actually need to load a library dynamically (at least explicitly) are pretty rare. This is often used for modular architectures (e.g. "plugins" or "extensions") where individual pieces of the application are distributed separately (which can be achieved more securely using IPC rather than dynamic loading... see my note below). Or for cases where your application is not allowed to include dependencies statically and needs to conditionally supply behavior based on the existence of certain library dependencies in the environment in which it happens to be executing. In most cases, though, you'll simply want to include a header that declares the symbols you need and compile for each target platform (possibly using #if...#else macros if there are symbols that vary across OSes or OS versions).
From a stability, security, and code complexity standpoint, I personally recommend that you avoid dynamic library loading. For core system functionality, it's reasonable to link against a dynamic library, but you'll want to do it in a way where the burden of dynamic loading is entirely on your toolchain (i.e. you shouldn't need to call dlopen or LoadLibrary explicitly). For other functionality, it is almost always better to statically link (assuming you distribute updates when there are security fixes for your dependencies), since this will avoid you getting broken by incompatible version updates and also prevent your users from experiencing dependency hell (you require version A but some other application requires version B); modular architectures are often better (and more securely) achieved through inter-process communication (IPC), since dynamically loaded libraries live in the process of the program that loads them (thereby giving them access to the entire process's virtual memory space), whereas with interprocess-communication, each component would be a separate process, and individual components would only have access to information that was given to it explicitly by the calling process, which would make it more difficult for a malicious component to steal data from the caller or other components or to produce instability.
The sanest thing if you want this to actually be used in the real world is probably to just compile the source as part of your program on each platform, like a regular function.
Next best is probably a separate process that you talk to rather than merge with.
Semi-sane (but still not a great choice, see our discussion in the other answer) would be making the shared library, like Michael Aaron Safyan said.
But if you want to know how it works just because - say, you want to write your own dynamic linker, or are doing some kind of runtime code generation like a JIT compiler, or if you just wanna know - you can make a raw code file.
To use it, what we'd have to do is similar to what the linker does - load the code at a particular address that it is made to work on and run it. There is position independent code that can run at any address, too.
Let's first get our function compiled and linked, then output into a raw image for a certain address. Assume the function is func in the file func.c and we're using gcc on Linux. (A Windows compiler would have similar options - gcc on Windows is exactly the same, I believe, but something like Digital Mars's C compiler does it differently with the linker command being /BINARY for instance)
Anyway, here's what I ran:
gcc -c func.c # makes func.o
ld func.o --oformat=binary -e func -o func.binary
This generates a file called func.binary. You can disassemble it most easily with ndisasm -b 64 func.binary (or -b 32 if you compiled the C in 32 bit mode) to confirm it looks right - I see an add instruction there, so looks good to me.
If you loaded that and mmaped then called it... it should work.
Problems will be quick to come up though:
If there's more than one function in that file, they'll all be squished together.
The addresses they try to use to call each other may be totally wrong.
Global variables and other static data will be messed up.
And there's more. The operating system uses more complex file formats for executables and libraries for a reason!
To go to the next step, you could consider writing an ELF or PE loader which reads that metadata off a standard file. Of course, once you get into much of this, you'll be doing exactly what the OS provides with dlopen and LoadLibrary.... so unless the goal is to just learn about the guts, just call those functions and call it done!

How to dynamically load often re-generated c code quickly?

I want to be able to generate C code dynamically and re-load it quickly into my running C program.
I am on Linux, how could this be done?
Can a library .so file on Linux be re-compiled and reloaded at runtime?
Could it be compiled without producing a .so file, could the compiled output somehow go to memory and then be reloaded ? I want to reload the compiled code quickly.
What you want to do is reasonable, and I am doing exactly that in MELT (a high level domain specific language to extend GCC; MELT is compiled to C, thru a translator itself written in MELT).
First, when generating C code (or many other source languages), a good advice is to keep some sort of abstract syntax tree (AST) in memory. So build first the entire AST of the generated C code, then emit it as C syntax. Don't think of your code generation framework without an explicit AST (in other words, generation of C code with a bunch of printf is a maintenance nightmare, you want to have some intermediate representation).
Second, the main reason to generate C code is to take advantage of a good optimizing compiler (another reason is the portability and ubiquity of C). If you don't care about performance of the generated code (and TCC compiles very quickly C into a very naive and slow machine code) you could use some other approaches, e.g. using some JIT libraries like Gnu lightning (very quick generation of slow machine code), Gnu Libjit or ASMJIT (generated machine code is a bit better), LLVM or GCCJIT (good machine code generated, but generation time comparable to a compiler).
So if you generate C code and want it to run quickly, the compilation time of the C code is not negligible (since you probably would fork a gcc -O -fPIC -shared command to make some shared object foo.so out of your generated foo.c). By experience, generating C code takes much less time than compiling it (with gcc -O). In MELT, the generation of C code is more than 10x faster than its compilation by GCC (and usually 30x faster). But the optimizations done by a C compiler are worth it.
Once you emitted your C code, forked its compilation into a .so shared object, you can dlopen it. Don't be shy, my manydl.c example demonstrates that on Linux you can dlopen a big lot of shared objects (many hundreds of thousands). The real bottleneck is the compilation of the generated C code. In practice, you don't really need to dlclose on Linux (unless you are coding a server program needing to run for months); an unused shared module can stay practically dlopen-ed and you mostly are leaking process address space (which is a cheap resource), since most of that unused .so would be swapped-out. dlopen is done quickly, what takes time is the compilation of a C source, because you really want the optimization to be done by the C compiler.
You coul use many other different approaches, e.g. have a bytecode interpreter and generate for that bytecode, use Common Lisp (e.g. SBCL on Linux which compiles dynamically to machine code), LuaJit, Java, MetaOcaml etc.
As others suggested, you don't care much about the time to write a C file, and it will stay in filesystem cache in practice (see also this). And writing it is much faster than compiling it, so staying in memory is not worth the trouble. Use some tmpfs if you are concerned by I/O times.
addenda
You asked
Can a library .so file on Linux be re-compiled and re- loaded at runtime?
Of course yes: you should fork a command to build the library from the generated C code (e.g. a gcc -O -fPIC -shared generated.c -o generated.so, but you could do it indirectly e.g. by running a make -j, especially if the generated.so is big enough to make it relevant to split the generated.c in several C generated files!) and then you dynamically load your library with dlopen (giving a full path like /some/file/path/to/generated.so, and probably the RTLD_NOW flag, to it) and you have to use dlsym to find relevant symbols inside. Don't think of re-loading (a second time) the same generated.so, better to emit a unique generated1.c (then generated2.c etc...) C file, then to compile it to a unique generated1.so (the second time to generated2.so, etc...) then to dlopen it (and this can be done many hundred thousands of times). You may want to have, in the emitted generated*.c files, some constructor functions which would be executed at dlopen time of the generated*.so
Your base application program should have defined a convention about the set of dlsym-ed names (usually functions) and how they are called. It should only directly call functions in your generated*.so thru dlsym-ed function pointers. In practice you would decide for example that each generated*.c defines a function void dynfoo(int) and int dynbar(int,int) and use dlsym with "dynfoo" and "dynbar" and call these thru function pointers (returned by dlsym). You should also define conventions of how and when these dynfoo and dynbar would be called. You'll better link your base application with -rdynamic so that your generated*.c files could call your application functions.
You don't want your generated*.so to re-define existing names. For instance, you don't want to redefine malloc in your generated*.c and expect all heap allocation functions to magically use your new variant (that probably won't work, and if even if it did, it would be dangerous).
You probably won't bother to dlclose a dynamically loaded shared object, except at application clean-up and exit time (but I don't bother at all to dlclose). If you do dlclose some dynamically loaded generated*.so file, be sure that nothing is used in it: no pointers, not even return addresses in call frames, are existing to it.
P.S. the MELT translator is currently 57KLOC of MELT code translated to nearly 1770KLOC of C code.
Your best bet's probably the TCC compiler, which allows you to do exactly this --- compile source code, add it to your program, run it, all without touching files.
For a more robust but non-C-based solution, you should probably check out the LLVM project, which does much the same thing but from the perspective of producing JITs. You don't get to go via C, instead using a kind of abstract portable machine code, but the generated code is loads faster and it's under more active development.
OTOH if you want to do it all manually by shelling out to gcc, compiling a .so and then loading it yourself, dlopen() and dlclose() will do what you want.
Are you sure C is the right answer here? There are various interpreted languages such as Lua, Bigloo Scheme, or perhaps even Python that embed very well into an existing C application. You can write the dynamic parts using the extension language, which will support reloading code at runtime.
The obvious disadvantage is performance - if you absolutely need the raw speed of compiled C then these may be a no-go.
If you want to reload a library dynamically, you can use dlopen function (see mans). It opens a library .so file and returns a void* pointer to it, then you can get a pointer to any function/variable of your library with dlsym.
To compile your libraries in-memory, well, the best thing I think you can do is creating memory filesystem as described here.

Advantages of dynamic linking with either ld utility vs. dlfcn API?

I am doing some research into platform independent code and found mention of the dlfcn API. It was the first time I came across mention of it and did further research into it. Now hopefully my lack of experience/understanding of platform independent code as well as compiling/linking isn't going to show in this post but to me the dlfcn API just lets us do the same dynamic linking programmatically that the ld utility does. If I have misconceptions please correct me as I would like to know. Regarding what I think I know about the ld utility and the dlfcn API I have some questions.
What are the advantages of using either the ld utility vs. dlfcn API to dynamically link?
My first thought was that the dlfcn API seems like a waste of my time since I need to request pointers to the functions vs. having ld examine a symbol table for undefined symbols and then linking them. Similarly ld does everything for me while I have to do everything by hand with the dlfcn API (i.e. open/load the library, get a function pointer, close the library, etc.). But on second glance I thought that there may be some advantages. One being that we can load a library out of memory after we are done using it.
In this way memory could be saved if we knew we didn't need to utilize a library the whole time. I am unsure if there is any "memory/library" management for libraries dynamically linked by ld? Similarly I am unsure of what scenarios/environments would we be interested in using the dlfcn API to save said memory as it seems this wouldn't be a problem in modern day systems. I presume one would be the usage of the library on a system with very very very limited resources (maybe some embedded system?).
What other advantages or disadvantages may there be?
What "coding pattern" is used for platform independent code in regards to dynamic linking?
If I was making platform independent code that depended on system calls I could see myself achieving platform independent code by coding in one of three styles:
Logical branching directly in my libraries code via macros. Something like:
void myAwesomeFunction()
{
...
#if defined(_MSC_VER)
// Call some Windows system call
#elif defined(__GNUC__)
// Call some Unix system call
...
}
Create generic system call functions and use those in my libraries code. Something like:
OS_Calls.h
void OS_openFile(string myFile)
{
...
#if defined(_MSC_VER)
// Call Windows system call to open file
#elif defined(__GNUC__)
// Call Unix system call to open file
...
}
MyAwesomeFunctions.cpp
#include "OS_Calls.h"
void myAwesomeFunction()
{
...
OS_openFile("my awesome file");
...
}
Similar to one but add a layer of abstraction by using the dlfcn API
MyLibraryLoader.h
void* GetLibraryFunction(void* lib, char* funcName)
{
...
return dlsym(lib, funcName);
}
MyAwesomeFunctions.cpp
#include "MyLibraryLoader.h"
void myAwesomeFunction()
{
Result result = GetLibraryFunction(someLib, someFunc)(arguments...);
}
What ones are typically used and why? And if there are any others that aren't listed and preferred to mine please let me know.
Thanks for reading this post. I will keep it updated so that it may serve as a future informative reference.
dlfcn and ld does not solve the same problem: in fact you can use both in your project.
The dlfcn API is meant to support plugin architectures, in which you define an interface which modules should implement. An application can then load different implementations of that interface, for various reasons (extensibility, customization, etc.).
ld, well, links the libraries your application request, but does that at compile time, not at runtime time. It doesn't support in any way plugin architectures, since ld links objects specified in the command line.
Of course you can only use the dlfcn API, but it is not meant to be used in that way and, of course, using it in that way would be a huge pain in your rectum.
For your second question, I think the best pattern is the second one.
Branching "directly in the code" can be confusing, because it's not immediately obvious what the two branches accomplish, something which is well-defined if you define a proper abstraction and you implement it using multiple branches for each supported architecture.
Using the dlfcn API is pretty pointless, because you don't have a uniform interface to call (that's exactly the argument that supports the second pattern), so it just adds bloats in your code.
HTH
I don't think dynamic linkage helps you much with platform independence.
Your second option seems like a reasonable way to be platform independence. Most of the code just calls your platform independent wrappers, while a small part of it is "dirty" with ifdefs.
I don't see how dynamic loading helps here.
Some pros and cons for dynamic loading:
1. Cons:
a. Not the "straightforward" way, requires more work.
b. Prevents standard tools (e.g. ldd) from analyzing dependenies (thus helping you undersatnd what you need to successfully run).
2. Pros:
a. Allows loading only what you need (e.g. depending on command line arguments), or unloading what you don't. This can save memory.
b. Lets you generate library names in more complicated ways (e.g. read a list of plugins from a configuration file).

Dynamic relocation of code section

Just out of curiosity I wonder if it is possible to relocate a piece of code during
the execution of a program. For instance, I have a function and this function should
be replaced in memory each time after it has been executed. One idea that came up our mind
is to use self-modifying code to do that. According to some online resources, self-modifying
code can be executed on Linux, but still I am not sure if such a dynamic relocation is possible. Has anyone experience with that?
Yes dynamic relocation is definitely possible. However, you have to make sure that the code is completely self-contained, or that it accesses globals/external functions by absolute references. If your code can be completely position independent, meaning the only references it makes are relative to itself, you're set. Otherwise you will need to do the fixups yourself at loading time.
With GCC, you can use -fpic to generate position independent code. Passing -q or --emit-relocs to the linker will make it emit relocation information. The ELF specification (PDF link) has information about how to use that relocation information; if you're not using ELF, you'll have to find the appropriate documentation for your format.
As Carl says, it can be done, but you're opening a can of worms. In practice, the only people who take the trouble to do this are academics or malware authors (now donning my flame proof cloak).
You can copy some code into a malloc'd heap region, then call it via function pointers, but depending on the OS you may have to enable execution in the segment. You can try to copy some code into the code segment (taking care not to overwrite the following function), but the OS likely has made this segment read-only. You might want to look at the Linux kernel and see how it loads its modules.
If all these different functions exist at compile time then you could simply use a function pointer to keep track of the next one that is to be called. If you absolutely have to modify the function at runtime and that modification can't be done in place then you could also use a function pointer that is updated with address of the new function when it is created/loaded. The rest of your system would then call the self-modifying function through the function pointer and therefore doesn't have to know or care about the self-modifying code and you only have to do the fixup in one place.

Resources