Given a library of C functions, is there a way to automate validation if exported functions are reentrant?
Either at runtime (after instrumentation if needed) or from code-analysis. Source code is available.
Note: This is not a std-C lib, nor a well-documented GNU lib with thread-safety contract.
A function is not considered "thread safe" if it CAN NOT be interrupted in the middle of its execution and then safely called again ("re-entered") before its previous invocations complete execution.
In the C standard library, some functions fall into this category. You do not need a tool such as Valgrind to check for thread safety, instead you should read the documentation (or man page) for the particular function you are concerned about.
Usually, but not always, C offers a thread safe counterpart if a function is not thread safe.
For example, the string tokenizer function strtok has a re-entrant version strtok_r
char *strtok(char *str, const char *delim);
char *strtok_r(char *str, const char *delim, char **saveptr);
with the difference being, your code (thread) maintains a pointer to the last tokenized string (the work in progress) instead of the function maintaining it. This allows multiple threads to call strtok_r in parallel.
In addition, here is another link on SO discussing Threadsafe vs re-entrant behavior.
--
EDIT: More directly related to the original question. I do not believe such a tool exists that can tell you if a function is re-entrant. Tools such as ltrace may help with this. My comments above were illustrating that documentation for the library should exist and I used the C standard library as an example.
Regarding Valgrind, there is a tool for it called Helgrind that can test for synchronization errors (see: http://valgrind.org/docs/manual/hg-manual.html, section 7.1)
Related
We have a software project with real-time constraints largely written in C++ but making use of a number of C libraries, running in a POSIX operating system. To satisfy real-time constraints, we have moved almost all of our text logging off of stderr pipe and into shared memory ring buffers.
The problem we have now is that when old code, or a C library, calls assert, the message ends up in stderr and not in our ring buffers with the rest of the logs. We'd like to find a way to redirect the output of assert.
There are three basic approaches here that I have considered:
1.) Make our own assert macro -- basically, don't use #include <cassert>, give our own definition for assert. This would work but it would be prohibitively difficult to patch all of the libraries that we are using that call assert to include a different header.
2.) Patch libc -- modify the libc implementation of __assert_fail. This would work, but it would be really awkward in practice because this would mean that we can't build libc without building our logging infra. We could make it so that at run-time, we can pass a function pointer to libc that is the "assert handler" -- that's something that we could consider. The question is if there is a simpler / less intrusive solution than this.
3.) Patch libc header so that __assert_fail is marked with __attribute__((weak)). This means that we can override it at link-time with a custom implementation, but if our custom implementation isn't linked in, then we link to the regular libc implementation. Actually I was hoping that this function already would be marked with __attribute__((weak)) and I was surprised to find that it isn't apparently.
My main question is: What are the possible downsides of option (3) -- patching libc so that this line: https://github.com/lattera/glibc/blob/master/assert/assert.h#L67
extern void __assert_fail (const char *__assertion, const char *__file,
unsigned int __line, const char *__function)
__THROW __attribute__ ((__noreturn__));
is marked with __attribute__((weak)) as well ?
Is there a good reason I didn't think of that the maintainers didn't already do this?
How could any existing program that is currently linking and running successfully against libc break after I patch the header in this way? It can't happen, right?
Is there a significant run-time cost to using weak-linking symbols here for some reason? libc is already a shared library for us, and I would think the cost of dynamic linking should swamp any case analysis regarding weak vs. strong resolution that the system has to do at load time?
Is there a simpler / more elegant approach here that I didn't think of?
Some functions in glibc, particularly, strtod and malloc, are marked with a special gcc attribute __attribute__((weak)). This is a linker directive -- it tells gcc that these symbols should be marked as "weak symbols", which means that if two versions of the symbol are found at link time, the "strong" one is chosen over the weak one.
The motivation for this is described on wikipedia:
Use cases
Weak symbols can be used as a mechanism to provide default implementations of functions that can be replaced by more specialized (e.g. optimized) ones at link-time. The default implementation is then declared as weak, and, on certain targets, object files with strongly declared symbols are added to the linker command line.
If a library defines a symbol as weak, a program that links that library is free to provide a strong one for, say, customization purposes.
Another use case for weak symbols is the maintenance of binary backward compatibility.
However, in both glibc and musl libc, it appears to me that the __assert_fail function (to which the assert.h macro forwards) is not marked as a weak symbol.
https://github.com/lattera/glibc/blob/master/assert/assert.h
https://github.com/lattera/glibc/blob/master/assert/assert.c
https://github.com/cloudius-systems/musl/blob/master/include/assert.h
You don't need attribute((weak)) on symbol __assert_fail from glibc. Just write your own implementation of __assert_fail in your program, and the linker should use your implementation, for example:
#include <stdio.h>
#include <assert.h>
void __assert_fail(const char * assertion, const char * file, unsigned int line, const char * function)
{
fprintf(stderr, "My custom message\n");
abort();
}
int main()
{
assert(0);
printf("Hello World");
return 0;
}
That's because when resolving symbols by the linker the __assert_fail symbol will already be defined by your program, so the linker shouldn't pick the symbol defined by libc.
If you really need __assert_fail to be defined as a weak symbol inside libc, why not just objcopy --weaken-symbol=__assert_fail /lib/libc.so /lib/libc_with_weak_assert_fail.so. I don't think you need to rebuild libc from sources for that.
If I were you, I would probably opt for opening a pipe(2) and fdopen(2)'ing stderr to take the write end of that pipe. I'd service the read end of the pipe as part of the main poll(2) loop (or whatever the equivalent is in your system) and write the contents to the ring buffer.
This is obviously slower to handle actual output, but from your write-up, such output is rare, so the impact ought to be negligable (especially if you already have a poll or select this fd can piggyback on).
It seems to me that tweaking libc or relying on side-effects of the tools might break in the future and will be a pain to debug. I'd go for the guaranteed-safe mechanism and pay the performance price if at all possible.
Where can I get a definitive answer, whether my memcpy (using the eglibc implementation that comes with Ubuntu) is thread safe? - Honestly, I really did not find a clear YES or NO in the docs.
By the way, with "thread safe" I mean it is safe to use memcpy concurrently whenever it would be safe to copy the date byte for byte concurrently. This should be possible at least if read-only data are copied to regions that do not overlap.
Ideally I would like to see something like the lists at the bottom of this page in the ARM compiler docs.
You can find that list here, at chapter 2.9.1 Thread-Safety : http://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_01
That is, this is a list over functions that posix does not require to be thread safe. All other functions are required to be thread safe. Posix includes the standard C library and the typical "unix" interfaces. (Full list here, http://pubs.opengroup.org/onlinepubs/9699919799/functions/contents.html)
memcpy() is specified by posix, but not part of the list in 2.9.1, and can therefore be considered thread safe.
The various environments on linux at least tries to implement posix to the best of its abilities - The functions on linux/glibc might be thread-safe even if posix doesn't require it to be - though this is rarely documented. For other functions/libraries than what posix covers, you are left with what their authors have documented...
From what I can tell, posix equates thread safety with reentrancy, and guarantees there is no internal data races. You, however, are responsible for the possible external data races - such as protecting yourself from calling e.g. memcpy() with memory that might be updated concurrently.
It depends on the function, and how you use it.
Take for example memcpy, it is generally thread safe, if you copy data where both source and destination is private to a single thread. If you write to data that can be read from/written to by another thread, it's no longer thread safe and you have to protect the access.
If a glibc function is not thread-safe then the man page will say so, and there will (most likely) be a thread safe variant also documented.
See, for example, man strtok:
SYNOPSIS
#include
char *strtok(char *str, const char *delim);
char *strtok_r(char *str, const char *delim, char **saveptr);
The _r (for "reentrant") is the thread-safe variant.
Unfortunately, the man pages do not make a habit of stating that a function is thread safe, but only mention thread-safety when it is an issue.
As with all functions, if you give it a pointer to a shared resource then it will become thread-unsafe. It is up to you to handle locking.
I am now porting an single-threaded library to support multi-threads, and I need the whole list of functions that use local static or global variables.
Any information is appreciated.
Check the manual page for each function you use ... the non-thread-safe ones will be identified as such, and the manual page will mention a thread safe version when there is one (e.g., readdir_r). You could extract the list by running a script over the man pages.
Edit: Although my answer has been accepted, I fear that it is inaccurate and possibly dangerous. For example, while strerror_r mentions that it is a thread safe version of strerror, strerror itself says nothing about thread safety ... what it says instead is "the string might be overwritten", which merely implies that it isn't thread-safe. So you need to search for at least "might be overwritten" as well as "thread", but there's no guarantee that even that will be complete.
Its always a good idea to know if a particular function is reentrant or not, but you must also consider the situation when you may call several reentrant functions from a shared piece of code from multiple threads, which could also lead to problems when using shared data.
So, if you have any data shared between threads, the data must be "protected" irregardless of the fact that the functions being called are reentrant.
Consider the following function:
void yourFunc(CommonObject *o)
{
/* This function is NOT thread safe */
reentrant_func1(o->propertyA);
reentrant_func2(o->propertyA);
}
If this function is not mutex protected, you will get undesired behavior in a multithreaded application, irregardless of the fact that func1 and func2 are reentrant.
I want to modifiy an existing shared library so that it uses different memory management routines depending on the application using the shared library.
(For now) there will be two families of memory management routines:
The standard malloc, calloc etc functions
specialized versions of malloc, calloc etc
I have come up with a potential way of solving this problem (with the help of some people here on SO). There are still a few grey areas and I would like some feedback on my proposal so far.
This is how I intend to implement the modification:
Replace existing calls to malloc/calloc etc with my_malloc/my_calloc etc. These new functions will invoke correctly assigned function pointers instead of calling hard coded function names.
Provide a mechanism for the shared library to initialize the function pointers used by my_malloc etc to point to the standard C memory mgmt routines - this allows me to provide backward compatability to applications which depend on this shared library - so they don't have to be modified as well. In C++, I could have done this by using static variable initialization (for example) - I'm not sure if the same 'pattern' can be used in C.
Introduce a new idempotent function initAPI(type) function which is called (at startup) by the application that need to use different mem mgmt routines in the shared libray. The initAPI() function assigns the memory mgmt func ptrs to the appropriate functions.
Clearly, it would be preferable if I could restrict who could call initAPI() or when it was called - for example, the function should NOT be called after API calls have been made to the library - as this will change the memory mgmt routines. So I would like to restrict where it is called and by whom. This is an access problem which can be solved by making the method private in C++, I am not sure how to do this in C.
The problems in 2 and 3 above can be trivially resolved in C++, however I am constrained to using C, so I would like to solve these issues in C.
Finally, assuming that the function pointers can be correctly set during initialisation as described above - I have a second question, regarding the visibility of global variables in a shared library, accross different processes using the shared library. The function pointers will be implemented as global variables (I'm not too concerned about thread safety FOR NOW - although I envisage wrapping access with mutex locking at some point)* and each application using the shared library should not interfere with the memory management routines used for another application using the shared library.
I suspect that it is code (not data) that is shared between processes using a shlib - however, I would like that confirmed - preferably, with a link that backs up that assertion.
*Note: if I am naively downplaying threading issues that may occur in the future as a result of the 'architecture' I described above, someone please alert me!..
BTW, I am building the library on Linux (Ubuntu)
Since I'm not entirely sure what the question being asked is, I will try to provide information that may be of use.
You've indicated c and linux, it is probably safe to assume you are also using the GNU toolchain.
GCC provides a constructor function attribute that causes a function to be called automatically before execution enters main(). You could use this to better control when your library initialization routine, initAPI() is called.
void __attribute__ ((constructor)) initAPI(void);
In the case of library initialization, constructor routines are executed before dlopen() returns if the library is loaded at runtime or before main() is started if the library is loaded at load time.
The GNU linker has a --wrap <symbol> option which allows you to provide wrappers for system functions.
If you link with --wrap malloc, references to malloc() will redirect to __wrap_malloc() (which you implement), and references to __real_malloc() will redirect to the original malloc() (so you can call it from within your wrapper implementation).
Instead of using the --wrap malloc option to provide a reference to the original malloc() you could also dynamically load a pointer to the original malloc() using dlsym(). You cannot directly call the original malloc() from the wrapper because it will be interpreted as a recursive call to the wrapper itself.
#define _GNU_SOURCE
#include <stdio.h>
#include <stdint.h>
#include <dlfcn.h>
void * malloc(size_t size) {
static void * (*func)(size_t) = NULL;
void * ret;
if (!func) {
/* get reference to original (libc provided) malloc */
func = (void *(*)(size_t)) dlsym(RTLD_NEXT, "malloc");
}
/* code to execute before calling malloc */
...
/* call original malloc */
ret = func(size);
/* code to execute after calling malloc */
...
return ret;
}
I suggest reading Jay Conrod's blog post entitled Tutorial: Function Interposition in Linux for additional information on replacing calls to functions in dynamic libraries with calls to your own wrapper functions.
-1 for the lack of concrete questions. The text is long, could have been written more succintly, and it does not contain a single question-mark.
Now to address your problems:
Static data (what you call "global variables") of a shared library is per-process. Your global variables in one process will not interfere with global variables in another process. No need for mutexes.
In C, you cannot restrict[1] who can call a function. It can be called by anybody who knows its name or has a pointer to it. You can code initAPI() such that it visibly aborts the program (crashes it) if it is not the first library function called. You are library writer, you set the rules of the game, and you have NO obligation towards coders who do not respect the rules.
[1] You can declare the function with static, meaning it can be called by name only by the code within the same translation unit; it can still be called through a pointer by anybody who manages to obtain a pointer to it. Such functions are not "exported" from libraries, so this is not applicable to your scenario.
Achieving this:
(For now) there will be two families of memory management routines:
The standard malloc, calloc etc functions
specialized versions of malloc, calloc etc
with dynamic libraries on Linux is trivial, and does not require the complicated scheme you have concocted (nor the LD_PRELOAD or dlopen suggested by #ugoren).
When you want to provide specialized versions of malloc and friends, simply link these routines into your main executable. Voila: your existing shared library will pick them up from there, no modifications required.
You could also build specialized malloc into e.g. libmymalloc.so, and put that library on the link line before libc, to achieve the same result.
The dynamic loader will use the first malloc it can see, and searches the list starting from the a.out, and proceeding to search other libraries in the same order they were listed on link command line.
UPDATE:
On further reflection, I don't think what you propose will work.
Yes, it will work (I use that functionality every day, by linking tcmalloc into my main executable).
When your shared library (the one providing an API) calls malloc "behind the scenes", which (of possibly several) malloc implementations does it get? The first one that is visible to the dynamic linker. If you link a malloc implementation into a.out, that will be the one.
It's easy enough for you to require that your initialization function is:
called from the main thread
that the client may call it exactly once
and that the client may provide the optional function pointers by parameter
If different applications run in separate processes, it's quite simple to do using dynamic libraries.
The library can simply call malloc() and free(), and applications that want to override it could load another library, with alternative implementations for these libraries.
This can be done with the LD_PRELOAD environment variable.
Or, if your library is loaded with dlopen(), just load the malloc library first.
This is basically what tools such as valgrind, which replace malloc, do.
Two separate questions here really: Can I use regexes in a multithreaded program without locking and, if so, can I use the same regex_t at the same time in multiple threads? I can't find an answer on Google or the manpages.
http://www.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html
2.9.1 Thread-Safety
All functions defined by this volume of POSIX.1-2008 shall be thread-safe, except that the following functions1 need not be thread-safe.
...
regexec and regcomp are not in that list, so they are required to be thread-safe.
See also: http://www.opengroup.org/onlinepubs/9699919799/functions/regcomp.html
Part of the rationale text reads:
The interface is defined so that the matched substrings rm_sp and rm_ep are in a separate regmatch_t structure instead of in regex_t. This allows a single compiled RE to be used simultaneously in several contexts; in main() and a signal handler, perhaps, or in multiple threads of lightweight processes.
Can I use regexes in a multithreaded program without locking
Different ones, yes.
can I use the same regex_t at the same time in multiple threads?
In general: If you plan on doing so, you will have to do the locking around the functions, since few data structures do the locking for you.
regexec: Since regexec however takes a const regex_t, executing regexec seems safe for concurrent execution without locking. (After all, this is POSIX.1-2001, where stupid stuff like static buffers as used in the early BSD APIs usually don't occur anymore.)