GCC Atomic Builtins instead of pthread? - c

I've found following article: Use GCC-provided atomic lock operations to replace pthread_mutex_lock functions
It refers to GCC Atomic Builtins.
What the article suggest, is to use GCC atomic builtins instead of pthread synchronization tools.
Is this a good idea?
PS. The mysql post is obviously misleading. Atomic Builtins can't replace all pthread tools. For example, the locking requires, that if a lock can't be acquired, a thread has to wait. In other words, it asks the OS to wait, so that the wait is passive. Simple GCC builtin can't do that.

Is this a good idea?
Not if you ever intend to compile the code with something other than gcc. Is pthreads causing you any specific problems?

If you are already using pthread, and the pthread lock functions already do what you want, it is best to use the pthread lock functions.
These atomic builtins are just the building blocks for higher-level primitives; writing these higher-level primitives tends to be tricky, and any mistakes can cause errors which can take a long time to show up (since they usually depend on timing). If you already have a library with higher-level primitives which do what you want and are fast enough for your needs (and do not assume they are too slow just because you have to do a function call), it is best to not reinvent the wheel.

No. The GCC built-ins probably make good sense to the guys who write operating systems, libc, and maybe pthreads itself, but for your average application there is no reason not to use the pthreads approach.
And even if you always use GCC, some day you may want to run a static analysis tool which won't handle all the customer GCC extensions.

C Extensions
http://gcc.gnu.org/ml/libstdc++/2006-06/msg00089.html
Atomic+Builtins
http://sources.redhat.com/ml/libc-alpha/2005-06/msg00132.html
http://www.redi.uklinux.net/doc/c++/shared_ptr.html

atomic builtins make sense if you want to improve performance. Builtins allow you to minimize contention caused by the serialization of mutexes. When you use mutexes and create a critical session you serialize access to that section of your code; in performance code you may want to try to avoid contention by using thread-specific data and when not possible using atomics. Last case is locking and when locking, minimize the time during which the lock is held (using messaging and double-checked locking though some claim it doesn't work -- works for me).

Related

Disable __thread support

I'm implementing a pthread replacement library which is extremely lightweight. There are several reasons why I want to disable __thread completely.
It's a waste of memory. If I'm creating a thousand threads which has nothing to do with the context that declares a variable with __thread they will still allocate the program will still have allocated 1000*the size of that data bytes and never use it. It's simply not memory compatible with the mass-concurrency model. If we need extremely lightweight fibers with only 8K of stack, a TLS block of just 4K would be an overhead of 50% of the memory used by each thread. In some scenarios the TLS overhead would be enormous.
TLS is a complex standard and I simply don't have the time/resources to support it. It's to expensive. Personally I think the standard is poorly designed. It should have defined standard functions that had to be provided by the linker so thread libraries can take control over where TLS allocation takes place and insert relevant offsets and addresses it requires. Also the standard ELF implementation has been infected with pthread, expecting pthread sized structs to calculate offsets making it really hard to adapt to something else.
It's just a bad pattern. It encourages using globals and creating functions with static functions/side effects. This is not a territory we want to be in if we're creating correct programs that are easy to analyze.
If we really need "thread context" for some magic that tracks thread state behind the scenes (like for allocation or cancellation tracking) why not just expose the magic that TLS uses to understand that context in the first place? Personally I'm just going to use the %fs register directly. This would not be possible in libraries for obvious reasons but why should they be thread aware to begin with? Why not just design them correctly so they get the context related data they need in the first place right in the argument list?
My question is simply: What is the simplest way to disable __thread support and make clang emit errors if you accidentally used it? How can I get errors if I load a dynamic library which happens to require TLS?
I believe the simplest way is adding something like this unconditionally to your CFLAGS (perhaps from clang's equivalent of the gcc specfile if you want it to be system-global):
-D__thread='^-^'
where the righthand side can be anything that's syntactically invalid (a constraint violation) at any point in a C program.
As for preventing loading of libraries with TLS, you'd have to patch the linker and/or dynamic linker to reject them. If you're just talking about dlopen, your program could first read the file and parse the ELF headers for TLS relocations, then reject the library (without passing it to dlopen) if it has any. This might even be possible with an LD_PRELOAD wrapper.
I agree with you that, especially in its current implementation, TLS is something whose use should generally be avoided, but may I ask if you've measured the costs? I think stamping it out completely will be fairly difficult on a system that's designed to use it, and there's much lower-hanging fruit for cutting bloat. Which libc are you using? If it's glibc, I'm pretty sure glibc has lots of TLS it uses internally itself these days... Of course if you're writing your own threads implementation, that's going to entail a lot of interaction with the rest of the standard library, so perhaps you're already patching it out...?
By the way (shameless plug follows), we have an extremely light-weight threads implementation in musl libc which presently does not have TLS. I don't think it would be easy to integrate with another libc (and I'm sure if you're writing your own you will find it difficult to integrate with glibc, especially glibc's dynamic linker, which expects TLS to be supported) but if you can use the whole library as-is, it may meet your needs for specific projects, or have useful code you can borrow (license is MIT).

How can I replay a multithreaded application?

I want to record synchronization operations, such as locks, sempahores, barriers of a multithreaded application, so that I can replay the recorded application later on, for the purpose of debugging.
On way is to supply your own lock, sempaphore, condition variables etc.. functions that also do logging, but I think that is an overkill, because down under they must be using some common synchronization operations.
So my question is which synchronization operations I should log so that I require minimum modifications to my program. In other words, which are the functions or macros in glibc and system calls over which all these synchronization operations are built? So that I only modify those for logging and replaying.
The best I can think of is debugging with gdb in 'record' mode:
Gdb process record/replay execution log
According to this page: GDB Process Record threading support is underway, but it might not be complete yet.
Less strictly answering your question, may I suggest
Helgrind
DRD
On other platforms, several other threading checkers exist, but I haven't got much experience with them.
In your case, an effective method of "logging" systems calls on Linux may be to use the LD_PRELOAD trick, and over-ride the actual system calls with your own versions of the calls that will log the use of the call and then forward to the actual system call.
A more extensive example is given here in Linux Journal.
As you can see at these links, the basic gist of the "trick" is that you can make the system load your own dynamic library before any other system libraries, such as pthreads, etc., and then mask the calls to those library functions by placing your own versions of those functions as the precendent. You can then, inside your over-riding function, log the use of the original function, as well as pass on the arguments to the actual call you're attempting to log.
The nice thing about this method is it will catch pretty much any call you can make, both a function that remains entirely in user-land, as well as a function that will make a kernel call.
So GDB record mode doesn't support multithreading, but the RR record/replay system absolutely does: https://rr-project.org/ .
For a commercial solution with fewer technical restrictions, there's also UDB: https://undo.io/solutions/ .
I've worked on debuggers for some years now and from what I've seen, the GDB record+replay stuff is really not ready for primetime, for this and other reasons (eg, slowdown & huge memory requirements).
If you can get it to work in your dev environment, record+replay/reversible debugging can be pretty gamechanging for your workflow; I hope you find a way to leverage it.

Task library for C?

Is there a task library for C? I'm talking about the parallel task library as it exists in C#, or Java. In other words, I need a layer of abstraction over pthread for Linux. Thanks.
Give a look at OpenMP.
In particular, you might be interested in the Task feature of OpenMP 3.0.
I suggest you, however, to try to see if your problem can be solved using other, "basic" constructs, such as parallel for, since they are simpler to use.
Probably the most widely-used parallel programming primitives aside from the Win32 ones are those provided by pthreads.
They are quite low-level but include everything you need to write an efficient blocking queue and so create a thread pool of workers that carry out a queue of asynchronous tasks.
There is also a Win32 implementation so you can use the same codebase on Windows as well as POSIX systems.
Many concepts in TPL (Task, Work-Stealing Scheduler,...) are inspired by a very successful project named Cilk at MIT. Their advanced framework (Cilk Plus) was acquired by Intel and integrated to Intel Parallel Building Block. You still can use Cilk as an open source project without some advanced features. The good news is Intel is releasing Cilk Plus as open source in GCC.
You should try out Cilk as it adds another layer of abstraction to C, which makes it easy to express parallel algorithms but it is close enough to C to ensure good performance.
I've been meaning to checking out libdispatch. Yeah it's built for OS X and blocks, but they have function interfaces as well. Haven't really had time to look at it yet though so not sure if it fills all your needs.
There is an academia project called Wool that implements work stealing scheduler in C (with significant help of C preprocessor AFAIK). Might be worth looking at, though it does not seem actively developed.

Mixing OpenMP with pthreads

My question is whether is it a good idea to mix OpenMP with pthreads. Are there applications out there which combine these two. Is it a good practice to mix these two? Or typical applications normally just use one of the two.
Typically it's better to just use one or the other. But for myself at least, I do regularly mix the two and it's safe if it's done correctly.
The most common case I do this is where I have a lower-level library that is threaded using pthreads, but I'm calling it in a user application that uses OpenMP.
There are some cases where it isn't safe. If for example, you kill a pthread before you exit all OpenMP regions in that thread.
I don't think so..
Its not a good idea. See the thing is, OpenMP is basically made for portability. Now if u are using pthread, then you are loosing the very essence of it!
pthread could only be supported by POSIX compliant OS's. While OpenMP could be used virtually on any OS provided they have a support for it.
Anyway, OpenMP gives you an abstraction much higher than what is provided by pthead.
No problem.
The purpose of OpenMP and pthreads are different. OpenMP is perfect to write a loop-level parallelism. However, OpenMP is not adequate to express sophisticated thread communications and synchronizations. OpenMP does not support all kinds of synchronizations, such as condition variables.
The caveat would be, as Mystrical pointed out, handling and accessing native threads within OpenMP parallel constructs.
FYI, Intel's TBB and Cilk Plus are also often used in a mixed way.
On Windows and Linux it seems to work just fine. However, OpenMP does not work on a Mac if it is run in a new thread. It only works in the main thread.
It appears that the behavior of how to mix the two threading modules is not defined. Some platform/compilers support it, others do not.
Sure. I do it all the time. You have to be careful. Why do it, though? Because there are some instances in which you have to! In complicated tasking models, such as pipelined functions where you want to keep the pipe going, it may be the only way to take advantage of all the power available.
I find very hard that you would need to use pthreads if you already use OpenMP. You could use a section pragma to run procedures with different functions. I personally have used it to implement pipeline parallelism.
Nowadays OpenMP does much more than pthreads, so if you use OpenMP you are covered. For instance, GCC 5.0 forward implements OpenMP extensions that exports code to GPU. :D

Portable thread-safety in C?

Purpose
I'm writing a small library for which portability is the biggest concern. It has been designed to assume only a mostly-compliant C90 (ISO/IEC 9899:1990) environment... nothing more. The set of functions provided by the library all operate (read/write) on an internal data structure. I've considered some other design alternatives, but nothing else seems feasible for what the library is trying to achieve.
Question
Are there any portable algorithms, techniques, or incantations which can be used to ensure thread-safety? I am not concerned with making the functions re-entrant. Moreover, I am not concerned with speed or (possibly) wasting resources if the algorithm/technique/incantation is portable. Ideally, I don't want to depend on any libraries (such as GNU Pth) or system-specific operations (like atomic test-and-set).
I have considered modifying Lamport's bakery algorithm, but I do not know how to alter it to work inside of the functions called by the threads instead of working in the threads themselves.
Any help is greatly appreciated.
Without OS/hardware support, at least an atomic CAS, there's nothing you can do that's practical. There are portable libraries that abstract various platforms into a common interface, though.
http://www.gnu.org/software/pth/related.html
Almost all systems (even Windows) can run libpthread these days.
Lamport's bakery algorithm would probably work; unfortunately, there are still practical problems with it. Specifically, many CPUs implement out-of-order memory operations: even if you've compiled your code into a perfectly correct instruction sequence, the CPU, when executing your code, may decide to reorder the instructions on the fly to achieve better performance. The only way to get around this is to use memory barriers, which are highly system- and CPU-specific.
You really only have two choices here: either (1) keep your library thread-unsafe and make your users aware of that in the documentation, or (2) use a platform-specific mutex. Option 2 can be made easier by using another library that implements mutexes for a large variety of platforms and provides you with a unified, abstract interface.
Functions either cannot be thread safe or are innately thread safe, depending on how you want to look at it. And threading/locking is innately platform specific. Really, it is up to users of your library to handle the the threading issues.

Resources