Safety nets in complex multi-threaded code? - c

As a developer who has just finished writing thousands of lines of complex multi-threaded 'C' code in a project, and which is going to be enhanced, modified etc. by several other developers unfamiliar with this code in the future, I wanted to find out what kind of safety nets do you guys try to put in such code? As an example I could do these:
Define accessor macros for lock protected
structure members, which assert that
the corresponding lock is held. This
makes it clear that these members
are lock-protected to anyone unfamiliar with this code.
Functions which are supposed to be
called with some spinlock held,
assert that the spinlock is being held.
What kind of safety nets have you put into multi-threaded code that you have written?
What kind of problems have you faced when other developers modified such code?
What kind of debugging aids have you put into such code?
Thanks for your comments.

There are a number of things we do in our product (a hypervisor designed to help you find concurrency bugs in applications) that are more generally useful. Note that we do these in our code itself (because its a highly concurrent piece of software) and that some of these are useful whether or not you are writing concurrent code.
Like you, we have the ability to assert(lock_held(...)) and use it.
We also (because we have our own scheduler) can assert(single_threaded()) for those (rare) situations where we count on no other thread being active in the system.
Memory corruption from one thread to another is pretty common (and hard to debug) so we do two things to address this: sprinkled throughout our thread stack are some magic cookies. We periodically (in our get_thread_id()) function invoke a "validate_thread_stack()" function that checks these cookies to make sure the stack is not corrupted.
Our malloc sticks magic cookies before and after a malloc block of memory and checks these on free. If anyone overruns their data these can be used to find the corruption early.
On free() we blast a well known pattern (in our case 0xdddd...) over the memory. This nicely corrupts anyone else who had a dangling pointer left over to that memory region.
We have a guard page (a memory page not mapped into the address space) near the bottom of the thread stack. If the thread overruns its stack, we catch it via page fault and drop into our debugger.
Our locks are witnessed. Checkout the FreeBSD lock witness code. Its like that but homebrew. Basically the witness code is a lightweight way of detecting potential deadlocks by looking at cycles in the lock acquisition graph.
Our locks are also wrapped with accessors that record the file/line number of acquisition and release. For double unlocks or double locks, you get pretty debug information on your screwup.
Our locks are also profiled. Once you get your code working you want it working well. We track the usual things like how many acquisitions, how long it took to acquire it.
In our system, we have an expectation that locks are not contended (we carefully designed the code this way). So if you wait for a spin lock longer than a second or two in our system you get dropped into the debugger because its most likely not a good thing.
Our variables that are meant to be updated atomically are wrapped inside of C struct's. The reason for this is to prevent sloppy code where you mix good use: atomic_increment(&var); and bad use var++. We make it very hard to write the latter code.
"volatile" is forbidden in our code base because its ambiguously implemented by compilers. Its a bad way to try and cobble together synchronization.
And of course code reviews. If you can't explain your concurrency assumptions and locking discipline to a colleague, then there's definitely issues with the code :-)

Make everything absolutely obvious, so that other developers cannot miss the synchronization scope when they view subsections of the code in isolation.
for example: don't hold a lock in code that spans multiple files.

Seems like you've answered your own question: put lots of assertions into the code. They will tell other developers what invariants and preconditions must hold.

Related

Calling convention which only allows one instance of a function at a time

Say I have multiple threads and all threads call the same function at approximately the same time.
Is there a calling convention which would only allow one instance of the function at any time? What I mean is that the function called by the second thread would only start after the function called by the first thread had returned.
Or are these calling conventions compiler specific? I don't have a whole lot of experience using them.
(Skip to the bottom if you don't care about the threading mumbo-jumbo)
As mentioned before, this is not a "calling convention" but a general problem of computing: concurrency. And the particular case where two or more threads can enter a shared zone at a time, and have a different outcome, is called a race condition (and also extends to/from electronics, and other areas).
The hard thing about threading is that computing is such a deterministic affair, but when threading gets involved, it adds a degree of uncertainty, which vary per platform/OS.
A one-thread affair would guarantee that it can do all tasks in the same order, always, but when you got multiple threads, and the order depends on how fast they can complete a task, shared other applications wanting to use the CPU, then the underlying hardware affects the results.
There's not much of a "sure fire way to do threading", as there's techniques, tools and libraries to deal with individual cases.
Locking in
The most well known technique is using semaphores (or locks), and the most well known semaphore is the mutex one, which only allows one thread at a time to access a shared space, by having a sort of "flag" that is raised once a thread has entered.
if (locked == NO)
{
locked = YES;
// Do ya' thing
locked = NO;
}
The code above, although it looks like it could work, it would not guarantee against cases where both threads pass the if () and then set the variable (which threads can easily do). So there's hardware support for this kind of operation, that guarantees that only one thread can execute it: The testAndSet operation, that checks and then, if available, sets the variable. (Here's the x86 instruction from the instruction set)
On the same vein of locks and semaphores, there's also the read-write lock, that allows multiple readers and one writer, specially useful for things with low volatility. And there's many other variations, some that limit an X amount of threads and whatnot.
But overall, locks are lame, since they are basically forcing serialisation of multi-threading, where threads actually need to get stuck trying to get a lock (or just testing it and leaving). Kinda defeats the purpose of having multiple threads, doesn't it?
The best solution in terms of threading, is to minimise the amount of shared space that threads need to use, possibly, elmininating it completely. Maybe use rwlocks when volatility is low, try to have "try and leave" kind of threads, that check if the lock is up, and then go away if it isn't, etc.
As my OS teacher once said (in Zen-like fashion): "The best kind of locking is the one you can avoid".
Thread Pools
Now, threading is hard, no way around it, that's why there are patterns to deal with such kind of problems, and the Thread Pool Pattern is a popular one, at least in iOS since the introduction of Grand Central Dispatch (GCD).
Instead of having a bunch of threads running amok and getting enqueued all over the place, let's have a set of threads, waiting for tasks in a "pool", and having queues of things to do, ideally, tasks that shouldn't overlap each other.
Now, the thread pattern doesn't solve the problems discussed before, but it changes the paradigm to make it easier to deal with, mentally. Instead of having to think about "threads that need to execute such and such", you just switch the focus to "tasks that need to be executed" and the matter of which thread is doing it, becomes irrelevant.
Again, pools won't solve all your problems, but it will make them easier to understand. And easier to understand may lead to better solutions.
All the theoretical things above mentioned are implemented already, at POSIX level (semaphore.h, pthreads.h, etc. pthreads has a very nice of r/w locking functions), try reading about them.
(Edit: I thought this thread was about Obj-C, not plain C, edited out all the Foundation and GCD stuff)
Calling convention defines how stack & registers are used to implement function calls. Because each thread has its own stack & registers, synchronising threads and calling convention are separate things.
To prevent multiple threads from executing the same code at the same time, you need a mutex. In your example of a function, you'd typically put the mutex lock and unlock inside the function's code, around the statements you don't want your threads to be executing at the same time.
In general terms: Plain code, including function calls, does not know about threads, the operating system does. By using a mutex you tap into the system that manages the running of threads. More details are just a Google search away.
Note that C11, the new C standard revision, does include multi-threading support. But this does not change the general concept; it simply means that you can use C library functions instead of operating system specific ones.

Making process survive failure in its thread

I'm writing app that has many independant threads. While I'm doing quite low level, dangerous stuff there, threads may fail (SIGSEGV, SIGBUS, SIGFPE) but they should not kill whole process. Is there a way to do it proper way?
Currently I intercept aforementioned signals and in their signal handler then I call pthread_exit(NULL). It seems to work but since pthread_exit is not async-signal-safe function I'm a bit concerned about this solution.
I know that splitting this app into multiple processes would solve the problem but in this case it's not an feasible option.
EDIT: I'm aware of all the Bad Thingsā„¢ that can happen (I'm experienced in low-level system and kernel programming) due to ignoring SIGSEGV/SIGBUS/SIGFPE, so please try to answer my particular question instead of giving me lessons about reliability.
The PROPER way to do this is to let the whole process die, and start another one. You don't explain WHY this isn't appropriate, but in essence, that's the only way that is completely safe against various nasty corner cases (which may or may not apply in your situation).
I'm not aware of any method that is 100% safe that doesn't involve letting the whole process. (Note also that sometimes just the act of continuing from these sort of errors are "undefined behaviour" - it doesn't mean that you are definitely going to fall over, just that it MAY be a problem).
It's of course possible that someone knows of some clever trick that works, but I'm pretty certain that the only 100% guaranteed method is to kill the entire process.
Low-latency code design involves a careful "be aware of the system you run on" type of coding and deployment. That means, for example, that standard IPC mechanisms (say, using SysV msgsnd/msgget to pass messages between processes, or pthread_cond_wait/pthread_cond_signal on the PThreads side) as well as ordinary locking primitives (adaptive mutexes) are to be considered rather slow ... because they involve something that takes thousands of CPU cycles ... namely, context switches.
Instead, use "hot-hot" handoff mechanisms such as the disruptor pattern - both producers as well as consumers spin in tight loops permanently polling a single or at worst a small number of atomically-updated memory locations that say where the next item-to-be-processed is found and/or to mark a processed item complete. Bind all producers / consumers to separate CPU cores so that they will never context switch.
In this type of usecase, whether you use separate threads (and get the memory sharing implicitly by virtue of all threads sharing the same address space) or separate processes (and get the memory sharing explicitly by using shared memory for the data-to-be-processed as well as the queue mgmt "metadata") makes very little difference because TLBs and data caches are "always hot" (you never context switch).
If your "processors" are unstable and/or have no guaranteed completion time, you need to add a "reaper" mechanism anyway to deal with failed / timed out messages, but such garbage collection mechanisms necessarily introduce jitter (latency spikes). That's because you need a system call to determine whether a specific thread or process has exited, and system call latency is a few micros even in best case.
From my point of view, you're trying to mix oil and water here; you're required to use library code not specifically written for use in low-latency deployments / library code not under your control, combined with the requirement to do message dispatch with nanosec latencies. There is no way to make e.g. pthread_cond_signal() give you nsec latency because it must do a system call to wake the target up, and that takes longer.
If your "handler code" relies on the "rich" environment, and a huge amount of "state" is shared between these and the main program ... it sounds a bit like saying "I need to make a steam-driven airplane break the sound barrier"...

GLib handle out of memory

I've a question concerning the GLib.
I would like to use the GLib in a server context but I'm not aware on how the memory is managed:
https://developer.gnome.org/glib/stable/glib-Memory-Allocation.html
If any call to allocate memory fails, the application is terminated. This also means that there is no need to check if the call succeeded.
If I look at the source code, if g_malloc failed, it will call g_error:
g_error()
define g_error(...)
A convenience function/macro to log an error message.
Error messages are always fatal, resulting in a call to abort() to terminate the application.[...]
But in my case, as I'm developing a server application, I don't want the application exit, I would prefer, as the traditional malloc function, the GLib functions returns NULL or something to indicate an error happened.
So, my question is, there is a manner to handle out of memory?
Is the GLib not recommended for server purpose applications?
If I look at the man of abort I can see that I can handle the signal but I'll make the management of out-of-memory errors a little bit painful...
The abort() function causes abnormal program termination to occur, unless
the signal SIGABRT is being caught and the signal handler does not
return.
Thanks for you help!
It's very difficult to recover from lack of memory. The reason for that is that it can be considered a terminal state, in the sense that lack of memory will persist for some time before it goes away. Even reacting to the lack of memory (like informing the user) might require more memory, for example, to build and send a message. A related problem is that there are operating systems (linux at least) that may be over optimistic about allocating memory. When the kernel realizes that memory is missing, it may kill the application, even if your code is handling the failures.
So either you have a much stricter grasp of your whole system than average, or you won't be able to successfully handle out of memory errors, and, in this case, it doesn't matter what the helper library is doing.
If you really want to control memory allocation while still using glib, you have partial ways to do that. Don't use any glib allocation function and use some from other library. Glib provides functions that receive a "free function" when necessary. For example:
https://developer.gnome.org/glib/2.31/glib-Hash-Tables.html#g-hash-table-new-full
The hash table constructor accepts functions for destroying both keys and values. In your case, the data will be allocated using custom allocation functions, while the hash data structures will be allocated with glib functions.
Alternatively you could use g_try_* macros to allocate memory, so you still use glib allocator, but it won't abort on error. Again, this only partially solves the problem. Internally, glib will implicitly call functions that may abort and it assumes it will never return on error.
About the general question: does it make sense for a server to crash when it's out of memory ? The obvious answer is no, but I can't estimate how theoretical this answer is. I can only expect that the server system be properly sized for its operation and reject as invalid any input that could potentially exceed its capacities, and for that, it doesn't matter which libraries it might use.
I'm probably editorializing a bit here, but the modern tendency to use virtual/logical memory (both names have been used, although "logical" is more distinct) does dramatically complicate knowing when memory is exhausted, although I think one can restore the old, real-(RAM + swap) model (I'll call this the physical model) in Linux with the following in /etc/sysctl.d/10-no-overcommit.conf:
vm.overcommit_memory = 2
vm.overcommit_ratio = 100
This restores the ability to have the philosophy that if a program's malloc just failed, that program has a good chance of having been the actual cause of memory exhaustion, and can then back away from the construction of the current object, freeing memory along the way, possibly grumbling at the user for having asked for something crazy that needed too much RAM, and awaiting the next request. In this model, most OOM conditions resolve almost instantly - the program either copes and presumably returns RAM, or gets killed immediately on the following SEGV when it tries to use the 0 returned by malloc.
With the virtual/logical memory models that linux tends to default to in 2013, this doesn't work, since a program won't find memory isn't available at malloc, but instead upon attempting to access memory later at which point the kernel finally realizes there's nowhere in RAM for it. This amounts to disaster, since any program on the system can die, rather than the one the ran the host out of RAM. One can understand why some GLib folks don't even care about trying to fix this problem, because with the logical memory model, it can't be fixed.
The original point of logical memory was to allow huge programs using more than half the memory of the host to still be able to fork and exec supporting programs. It was typically enabled only on hosts with that particular usage pattern. Now in 2013 when a home workstation can have 24+ GiB of RAM, there's really no excuse to have logical memory enabled at all 99% of the time. It should probably be disabled by default on hosts with >4 GiB of RAM at boot.
Anyway. So if you want to take the old-school physical model approach, make sure your computer has it enabled, or there's no point to testing your malloc and realloc calls.
If you are in that model, remember that GLib wasn't really guided by the same philosophy (see http://code.google.com/p/chromium/issues/detail?id=51286#c27 for just how madly astray some of them are). Any library based on GLib might be infected with the same attitude as well. However, there may be some interesting things one can do with GLib in the physical memory model by emplacing your own memory handlers with g_mem_set_vtable(), since you might be able to poke around in program globals and reduce usage in a cache or the like to free up space, then retry the underlying malloc. However, that's limited in its own way by not knowing which object was under construction at the point your special handler is invoked.

Real-life use cases of barriers (DSB, DMB, ISB) in ARM

I understand that DSB, DMB, and ISB are barriers for prevent reordering of instructions.
I also can find lots of very good explanations for each of them, but it is pretty hard to imagine the case that I have to use them.
Also, from the open source codes, I see those barriers from time to time, but it is quite hard to understand why they are used. Just for an example, in Linux kernel 3.7 tcp_rcv_synsent_state_process function, there is a line as follows:
if (unlikely(po->origdev))
sll->sll_ifindex = orig_dev->ifindex;
else
sll->sll_ifindex = dev->ifindex;
smp_mb();
if (po->tp_version <= TPACKET_V2)
__packet_set_status(po, h.raw, status);
where smp_mb() is basically DMB.
Could you give me some of your real-life examples?
It would help understand more about barriers.
Sorry, not going to give you a straight-out example like you're asking, because as you are already looking through the Linux source code, you have plenty of those to go around, and they don't appear to help. No shame in that - every sane person is at least initially confused by memory access ordering issues :)
If you are mainly an application developer, then there is every chance you won't need to worry too much about it - whatever concurrency frameworks you use will resolve it for you.
If you are mainly a device driver developer, then examples are fairly straightforward to find - whenever there is a dependency in your code on a previous access having had an effect (cleared an interrupt source, written a DMA descriptor) before some other access is performed (re-enabling interrupts, initiating the DMA transaction).
If you are in the process of developing a concurrency framework (, or debugging one), you probably need to read up on the topic a bit more - but your question suggests a superficial curiosity rather than an immediate need?
If you are developing your own method for passing data between threads, not based on primitives provided by a concurrency framework, that is for all intents and purposes a concurrency framework.
Paul McKenney wrote an excellent paper on the need for memory barriers, and what effects they actually have in the processor: Memory Barriers: a Hardware View for Software Hackers
If that's a bit too hardcore, I wrote a 3-part blog series that's a bit more lightweight, and finishes off with an ARM-specific view. First part is Memory access ordering - an introduction.
But if it is specifically lists of examples you are after, especially for the ARM architecture, you could do a lot worse than Barrier Litmus Tests and Cookbook.
The extra-extra light programmer's view and not entirely architecturally correct version is:
DMB - whenever a memory access requires ordering with regards to another memory access.
DSB - whenever a memory access needs to have completed before program execution progresses.
ISB - whenever instruction fetches need to explicitly take place after a certain point in the program, for example after memory map updates or after writing code to be executed. (In practice, this means "throw away any prefetched instructions at this point".)
Usually you need to use a memory barrier in cases where you have to make SURE that memory access occurs in a specific order. This might be required for a number of reasons, usually it's required when two or more processes/threads or a hardware component access the same memory structure, which has to be kept consistent.
It's used very often in DMA-transfers. A simple DMA control structures might look like this:
struct dma_control {
u32 owner;
void * data;
u32 len;
};
The owner will usually be set to something like OWNER_CPU or OWNER_HARDWARE, to indicate who of the two participants is allowed to work with the structure.
Code which changes this will usually like like this
dma->data = data;
dma->len = length;
smp_mb();
dma->owner = OWNER_HARDWARE;
So, data an len are always set before the ownership gets transfered to the DMA hardware. Otherwise the engine might get stale data, like a pointer or length which was not updated, because the CPU reordered the memory access.
The same goes for processes or threads running on different cores. The could communicate in a similar manner.
One simple example of a barrier requirement is a spinlock. If you implement a spinlock using compare-and-swap(or LDREX/STREX on ARM) and without a barrier, the processor is allowed to speculatively load values from memory and lazily store computed values to memory, and neither of those are required to happen in the order of the loads/stores in the instruction stream.
The DMB in particular prevents memory access reordering around the DMB. Without DMB, the processor could reorder a store to memory protected by the spinlock after the spinlock is released. Or the processor could read memory protected by the spinlock before the spinlock was actually locked, or while it was locked by a different context.
unixsmurf already pointed it out, but I'll also point you toward Barrier Litmus Tests and Cookbook. It has some pretty good examples of where and why you should use barriers.

Reader-Writer using semaphores and shared memory in C

I'm trying to make a simple reader/writer program using POSIX named semaphores, its working, but on some systems, it halts immediately on the first semaphore and thats it ... I'm really desperate by now. Can anyone help please? Its working fine on my system, so i can't track the problem by ltrace. (sorry for the comments, I'm from czech republic)
https://www.dropbox.com/s/hfcp44u2r0jd7fy/readerWriter.c
POSIX semaphores are not well suited for application code since they are interruptible. Basically any sort of IO to your processes will mess up your signalling. Please have a look at this post.
So you'd have to be really careful to interpret all error returns from the sem_ functions properly. In the code that you posted there is no such thing.
If your implementation of POSIX supports them, just use rwlocks, they are made for this, are much higher level and don't encounter that difficulty.
In computer science, the readers-writers problems are examples of a common computing problem in concurrency. There are at least three variations of the problems, which deal with situations in which many threads try to access the same shared memory at one time. Some threads may read and some may write, with the constraint that no process may access the share for either reading or writing, while another process is in the act of writing to it. (In particular, it is allowed for two or more readers to access the share at the same time.) A readers-writer lock is a data structure that solves one or more of the readers-writers problems.

Resources