Lock-free ping-pong in C11

Lock-free ping-pong in C11 - c

I'm very new to concurrency in C and trying to do some basic staff to understand how it works.
I wanted to write a conforming implementation of lock-free ping-pong, i.e. one thread prints ping, after that another thread prints pong and make it lock-free. Here is my attempt of that:
#if ATOMIC_INT_LOCK_FREE != 2
#error atomic int should be always lock-free
#else
static _Atomic int flag;
#endif
static void *ping(void *ignored){
while(1){
int val = atomic_load_explicit(&flag, memory_order_acquire);
if(val){
printf("ping\n");
atomic_store_explicit(&flag, !val, memory_order_release);
}
}
return NULL;
}
static void *pong(void *ignored){
while(1){
int val = atomic_load_explicit(&flag, memory_order_acquire);
if(!val){
printf("pong\n");
atomic_store_explicit(&flag, !val, memory_order_release);
}
}
return NULL;
}
int main(int args, const char *argv[]){
pthread_t pthread_ping;
pthread_create(&pthread_ping, NULL, &ping, NULL);
pthread_t pthread_pong;
pthread_create(&pthread_pong, NULL, &pong, NULL);
}
I tested it a few times and it worked, but there are things that seems weird:
It either lock-free or does not compile
Since the Standard defines lock-free property to be equal to 2 in order all operations on the atomic type are always lock-free. In particular I checked the compile code and it looks as
sub $0x8,%rsp
nopl 0x0(%rax)
mov 0x20104e(%rip),%eax # 0x20202c <flag>
test %eax,%eax
je 0xfd8 <ping+8>
lea 0xd0(%rip),%rdi # 0x10b9
callq 0xbc0 <puts#plt>
movl $0x0,0x201034(%rip) # 0x20202c <flag>
jmp 0xfd8 <ping+8>
This seems ok and we don't even need some sort of fence since Intel CPUs does not allow stores to be reordered with earlier loads. Such assumptions works only in case we know the hardware memory model which is not portable
Using stdatomics with pthreads
I'm stuck with glibc 2.27 where threads.h is not yet implemented. The question is if it is strictly-conforming to do so? Anyway this is sort of strange if we have atomics, but do not have threads. What is the conforming usage of stdatomics in multithreaded application then?

There are 2 meanings to the term lock-free:
the computer science meaning: one thread getting stuck can't impede the others. This task is impossible to make lock-free, you need the threads to wait for each other. (https://en.wikipedia.org/wiki/Non-blocking_algorithm)
using lockless atomics. You're basically creating your own mechanism for making a thread block, waiting in a nasty spin-loop with no fallback to give up the CPU eventually.
The individual stdatomic load and store operations are each separately lock-free, but you're using them to create sort of a 2-thread lock.
Your attempt looks correct to me. I don't see a way a thread can "miss" an update, because the other thread won't write another one until after this one finishes. And I don't see a way for both threads to be inside their critical sections at once.
A more interesting test would be using unlocked stdio operations, like
fputs_unlocked("ping\n", stdio); to take advantage of (and depend on) the fact that you've already guaranteed mutual exclusion between threads. See unlocked_stdio(3).
And test with output redirected to a file, so stdio is full buffered instead of line-buffered. (A system call like write() is fully serializing anyway, like atomic_thread_fence(mo_seq_cst).)
It either lock-free or does not compile
Ok, why is that weird? You chose to do that. It's not necessary; the algorithm would still work on C implementations without always-lock-free atomic_int.
atomic_bool might be a better choice, being lock-free on more platforms including 8-bit platforms where int takes 2 registers (because it has to be at least 16-bit). Implementations are free to make atomic_bool a 4-byte type on platforms where that's more efficient, but IDK if any actually do. (On some non-x86 platforms, byte loads / stores cost an extra cycle of latency to read/write in cache. Negligible here because you're always dealing with the inter-core cache miss case.)
You'd think atomic_flag would be the right choice for this, but it only provides test-and-set, and clear, as RMW operations. Not plain load or store.
Such assumptions works only in case we know the hardware memory model which is not portable
Yes, but this no-barriers asm code gen only happens while compiling for x86. Compilers can and should apply the as-if rule to create asm that runs on the compile target as if the C source was running on the C abstract machine.
Using stdatomics with pthreads
Does the ISO C Standard guarantee the atomic's behavior to be well-defined with all threading implementations (like pthreads, earlier LinuxThreads, etc...)
No, ISO C has nothing to say about language extensions like POSIX.
It does say in a footnote (not normative) that lockless atomics should be address-free so they work between different processes accessing the same shared memory. (Or maybe this footnote is only in ISO C++, I didn't go and re-check).
That's the only case I can think of ISO C or C++ trying to prescribe behaviour for extensions.
But the POSIX standard hopefully says something about stdatomic! That's where you should look; it extends ISO C, not the other way around, so pthreads is the standard that would have to specify that its threads work like C11 thread.h and that atomics work.
In practice of course, stdatomic is 100% fine with any threading implementation where all threads share the same virtual address space. This includes non-lock-free things like _Atomic my_large_struct foo;.

Related

Is `int` an atomic type?

Quoting gnu:
In practice, you can assume that int is atomic. You can also assume that pointer types are atomic; that is very convenient. Both of these assumptions are true on all of the machines that the GNU C Library supports and on all POSIX systems we know of.
How is this possible? All examples I've seen related to locks are made with an int counter, like this https://www.delftstack.com/howto/c/mutex-in-c/.

This text in the glibc manual is only relevant for things like volatile sig_atomic_t being safe between a thread and its signal handler. They're guaranteeing that volatile int will work that way, too, in GNU systems. Note the context you omitted:
To avoid uncertainty about interrupting access to a variable, you can use a particular data type for which access is always atomic: sig_atomic_t. Reading and writing this data type is guaranteed to happen in a single instruction, so there’s no way for a handler to run “in the middle” of an access.
This says nothing about atomicity wrt. other threads or concurrency. An interrupt handler takes over the CPU and runs instead of your code. Asynchronously at any point, but the main thread and its signal handler aren't both running at once (especially not on different CPU cores).
For threading, atomicity isn't sufficient, you also need a guarantee of visibility between threads (which you can get from _Atomic int with memory_order_relaxed), and sometimes ordering which you can get with stronger memory orders.
See the early parts of Why is integer assignment on a naturally aligned variable atomic on x86? for more discussion of why a C type having a width that is naturally atomic on the target you're compiling for is not sufficient for much of anything.
You sometimes also need RMW atomicity, like the ability to do an atomic_fetch_add such that if 1000 such +=1 operations happen across multiple threads, the total result will be like +=1000. For that you absolutely need compiler support (or inline asm), like C11 _Atomic int. Can num++ be atomic for 'int num'?
The guarantee that int is atomic means atomic_int should always be lock-free, and cheap, but it absolutely does not mean plain int is remotely safe for data shared between threads. That's data-race UB, and even if you use memory barriers like GNU C asm("" ::: "memory") to try to get the compiler to not keep the value in a register, your code can break in lots of interesting ways that are more subtle than some of the obvious breakage mechanisms. See Who's afraid of a big bad optimizing compiler? on LWN for some caveats about doing that in the Linux kernel, where they use volatile for atomicity.
(Fun fact: GNU C at least de-facto gives pure-load and pure-store atomicity with volatile int64_t on 64-bit machines, unlike with plain int64_t. ISO C doesn't guarantee even that for volatile, which is why the Linux kernel depends on being compiled with GCC or maybe clang.)

Do I need to write explicit memory barrier for multithreaded C code?

I'm writing some code on Linux using pthread multithreading library and I'm currently wondering if following code is safe when compiled with -Ofast -lto -pthread.
// shared global
long shared_event_count = 0;
// ...
pthread_mutex_lock(mutex);
while (shared_event_count <= *last_seen_event_count)
pthread_cond_wait(cond, mutex);
*last_seen_event_count = shared_event_count;
pthread_mutex_unlock(mutex);
Are the calls to pthread_* functions enough or should I also include memory barrier to make sure that the change to global variable shared_event_count is actually updated during the loop? Without memory barrier the compiler would be freely to optimize the variable as register integer only, right? Of course, I could declare the shared integer as volatile which would prevent keeping the contents of that variable in register only during the loop but if I used the variable multiple times within the loop, it could make sense to only check the fresh status for the loop conditition only because that could allow for more compiler optimizations.
From testing the above code as-is, it appears that the generated code actually sees the changes made by another thread. However, is there any spec or documentation that actually guarantees this?
The common solution seems to be "don't optimize multithreaded code too aggressively" but that seems like a poor man's workaround instead of really fixing the issue. I'd rather write correct code and let the compiler optimize as much as possible within the specs (any code that gets broken by optimizations is in reality using e.g. undefined behavior of C standard as assumed stable behavior, except for some rare cases where compiler actually outputs invalid code but that seems to be very very rare these days).
I'd much prefer writing the code that works with any optimizing compiler – as such, it should only use features specified in the C standard and the pthread library documentation.
I found an interesting article at https://www.alibabacloud.com/blog/597460 which contains a trick like this:
#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
This was actually used first in Linux kernel and it triggered a compiler bug in old GCC versions: https://lwn.net/Articles/624126/
As such, let's assume that the compiler is actually following the spec and doesn't contain a bug but implements every possible optimization known to man allowed by the specs. Is the above code safe with that assumption?
Also, does pthread_mutex_lock() include memory barrier by the spec or could compiler re-order the statements around it?

The compiler will not reorder memory accesses across pthread_mutex_lock() (this is an oversimplification and not strictly true, see below).
First I’ll justify this by talking about how compilers work, then I’ll justify this by looking at the spec, and then I’ll justify this by talking about convention.
I don’t think I can give you a perfect justification from the spec. In general, I would not expect a spec to give you a perfect justification—it’s turtles all the way down (do you have a spec for how to interpret the spec?), and the spec is designed to be read and understood by actual humans who understand the relevant background concepts.
How This Works
How this works—the compiler, by default, assumes that a function it doesn’t know can access any global variable. So it must emit the store to shared_event_count before the call to pthread_mutex_lock()—as far as the compiler knows, pthread_mutex_lock() reads the value of shared_event_count.
Inside pthread_mutex_lock is a memory fence for the CPU, if necessary.
Justification
From n1548:
In the abstract machine, all expressions are evaluated as specified by the semantics. An actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no needed side effects are produced (including any caused by calling a function or accessing a volatile object).
Yes, there’s LTO. LTO can do some very surprising things. However, the fact is that writing to shared_event_count does have side effects and those side effects do affect the behavior of pthread_mutex_lock() and pthread_mutex_unlock().
The POSIX spec states that pthread_mutex_lock() provides synchronization. I could not find an explanation in the POSIX spec of what synchronization is, so this may have to suffice.
POSIX 4.12
Applications shall ensure that access to any memory location by more than one thread of control (threads or processes) is restricted such that no thread of control can read or modify a memory location while another thread of control may be modifying it. Such access is restricted using functions that synchronize thread execution and also synchronize memory with respect to other threads. The following functions synchronize memory with respect to other threads:
Yes, in theory the store to shared_event_count could be moved or eliminated—but the compiler would have to somehow prove that this transformation is legal. There are various ways you could imagine this happening. For example, the compiler might be configured to do “whole program optimization”, and it may observe that shared_event_count is never read by your program—at which point, it’s a dead store and can be eliminated by the compiler.
Convention
This is how pthread_mutex_lock() has been used since the dawn of time. If compilers did this optimization, pretty much everyone’s code would break.
Volatile
I would generally not use this macro in ordinary code:
#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
This is a weird thing to do, useful in weird situations. Ordinary multithreaded code is not a sufficiently weird situation to use tricks like this. Generally, you want to either use locks or atomics to read or write shared values in a multithreaded program. ACCESS_ONCE does not use a lock and it does not use atomics—so, what purpose would you use it for? Why wouldn’t you use atomic_store() or atomic_load()?
In other words, it is unclear what you would be trying to do with volatile. The volatile keyword is easily the most abused keyword in C. It is rarely useful except when writing to memory-mapped IO registers.
Conclusion
The code is fine. Don’t use volatile.

Using GCC __sync extensions for a portable C library

I am developing a C library on OS X (10.10.x which happens to ship with GCC 4.2.x). This library is intended to be maximally portable and not specific to OS X.
I would like the end users to have the least headaches in building from source. So while the project is coded to std=c11 to get some of the benefits of the most modern C, it seems optional matter such as atomics are not supported by this version of GCC.
I am assuming GNU-Linux and various BSD end users to have either (a) a later version of GCC, or (b) the chops to install the latest and greatest.
Is it a good decision to rely on the __sync extensions of GCC for the required CAS (etc.) semantics?

I think you need to take a step back and first define all your use cases. The merits of __sync vs C11 atomics aside, better to define your needs first (i.e. __sync/atomics are solutions not needs).
The Linux kernel is one of the heaviest, most sophisticated users of locking, atomics, etc. and C11 atomics aren't powerful enough for it. See https://lwn.net/Articles/586838/
For example, you might be far better off wrapping things in pthread_mutex_lock / pthread_mutex_unlock pairs. Declaring a struct as C11 atomic does not guarantee atomic access to the whole struct, only parts of it. So, if you needed the following to be atomic:
glob.x = 5;
glob.y = 7;
glob.z = 9;
You would be better wrapping this in the pthread_mutex_* pairing. For comparison, inside the Linux kernel, this would be spin locks or RCU. In fact, you might use RCU as well. Note that doing:
CAS(glob.x,5)
CAS(glob.y,7)
CAS(glob.z,9)
is not the same as the mutex pairing if you want an all or nothing update.
I'd wrap your implementation in some thin layer. For example, the best way might be __sync on one arch [say BSD] and atomics on another. By abstracting this into a .h file with macros/inlines, you can write "common code" without lots of #ifdef's everywhere.
I wrote a ring queue struct/object. Its updater could use CAS [I wrote my own inline asm for this], pthread_mutex_*, kernel spin locks, etc. Actual choice of which was controlled by one or two #ifdef's inside my_ring_queue.h
Another advantage to abstraction: You can change your mind farther down the road. Suppose you did an early pick of __sync or atomics. You code this up in 200 places in 30 files. Then, comes the "big oops" where you realize this was the wrong choice. Lots of editing ensues. So, never put a naked [say] __sync_val_compare_and_swap in any of your .c files. Put it in once in my_atomics.h as something like #define MY_CAS_VAL(...) __sync_val_compare_and_swap(...) and use MY_CAS_VAL
You might also be able to reduce the number of places that need interthread locking by using thread local storage for certain things like subpool allocs/frees.
You may also want to use a mixture of CAS and lock pairings. Some specific uses fair better with low level CAS and others would be more efficient with mutex pairs. Again, it helps if you can define your needs first.
Also, consider the final disaster scenario: The compiler doesn't support atomics and __sync is not available [or does not work] for the arch you're compiling to. What then?
In that case, note that all __sync operations can be implemented using pthread_mutex pairings. That's your disaster fallback.

How to create atomic section in c [duplicate]

Are there functions for performing atomic operations (like increment / decrement of an integer) etc supported by C Run time library or any other utility libraries?
If yes, what all operations can be made atomic using such functions?
Will it be more beneficial to use such functions than the normal synchronization primitives like mutex etc?
OS : Windows, Linux, Solaris & VxWorks

Prior to C11
The C library doesn't have any.
On Linux, gcc provides some -- look for __sync_fetch_and_add, __sync_fetch_and_sub, and so on.
In the case of Windows, look for InterlockedIncrement, InterlockedDecrement``, InterlockedExchange`, and so on. If you use gcc on Windows, I'd guess it also has the same built-ins as it does on Linux (though I haven't verified that).
On Solaris, it'll depend. Presumably if you use gcc, it'll probably (again) have the same built-ins it does under Linux. Otherwise, there are libraries floating around, but nothing really standardized.
C11
C11 added a (reasonably) complete set of atomic operations and atomic types. The operations include things like atomic_fetch_add and atomic_fetch_sum (and *_explicit versions of same that let you specify the ordering model you need, where the default ones always use memory_order_seq_cst). There are also fence functions, such as atomic_thread_fence and atomic_signal_fence.
The types correspond to each of the normal integer types--for example, atomic_int8_t corresponding to int8_t and atomic_uint_least64_t corrsponding to uint_least64_t. Those are typedef names defined in <stdatomic.h>. To avoid conflicts with any existing names, you can omit the header; the compiler itself uses names in the implementor's namespace (e.g., _Atomic_uint_least32_t instead of atomic_uint_least32_t).

'Beneficial' is situational. Always, performance depends on circumstances. You may expect something wonderful to happen when you switch out a mutex for something like this, but you may get no benefit (if it's not that popular of a case) or make things worse (if you accidently create a 'spin-lock').

Across all supported platforms, you can use use GLib's atomic operations. On platforms which have atomic operations built-in (e.g. assembly instructions), glib will use them. On other platforms, it will fall back to using mutexes.
I think that atomic operations can give you a speed boost, even if mutexes are implemented using them. With the mutex, you will have at least two atomic ops (lock & unlock), plus the actual operation. If the atomic op is available, it's a single operation.

Not sure what you mean by the C runtime library. The language proper, or the standard library does not provide you with any means to do this. You'd need to use a OS specific library/API. Also, don't be fooled by sig_atomic_t -- they are not what it seems at first glance and are useful only in the context of signal handlers.

On Windows, there are InterlockedExchange and the like. For Linux, you can take glibc's atomic macros - they're portable (see i486 atomic.h). I don't know a solution for the other operating systems.
In general, you can use the xchg instruction on x86 for atomic operations (works on Dual Core CPUs, too).
As to your second question, no, I don't think that using atomic operations will be faster than using mutexes. For instance, the pthreads library already implements mutexes with atomic operations, which is very fast.

About atomicity guarantee in C

On x86 machines, instructions like inc, addl are not atomic and under SMP environment it is not safe to use them without lock prefix. But under UP environment it is safe since inc, addl and other simple instructions won't be interrupted.
My problem is that, given a C-level statement like
x = x + 1;
Is there any guarantees that C compiler will always use UP-safe instructions like
incl %eax
but not those thread-unsafe instructions(like implementing the C statement in several instructions which may be interrupted by a context switch) even in a UP environment?

No.
You can use "volatile", which prevents the compiler from holding x in a temporary register, and for most targets this will actually have the intended effect. But it isn't guaranteed.
To be on the safe side you should either use some inline asm, or if you need to stay portable, encapsulate the increment with mutexes.

There is absolutely no guarantee that "x - x + 1" will compile to interrupt-safe instructions on any platform, including x86. It may well be that it is safe for a specific compiler and specific processor architecture but it's not mandated in the standards at all and the standard is the only guarantee you get.
You can't consider anything to be safe based on what you think it will compile down to. Even if a specific compiler/architecture states that it is, relying on it is very bad since it reduces portability. Other compilers, architectures or even later versions on the same compiler and architecture can break your code quite easily.
It's quite feasible that x = x + 1 could compile to an arbitrary sequence such as:
load r0,[x] ; load memory into reg 0
incr r0 ; increment reg 0
stor [x],r0 ; store reg 0 back to memory
on a CPU that has no memory-increment instructions. Or it may be smart and compile it into:
lock ; disable task switching (interrupts)
load r0,[x] ; load memory into reg 0
incr r0 ; increment reg 0
stor [x],r0 ; store reg 0 back to memory
unlock ; enable task switching (interrupts)
where lock disables and unlock enables interrupts. But, even then, this may not be thread-safe in an architecture that has more than one of these CPUs sharing memory (the lock may only disable interrupts for one CPU), as you've already stated.
The language itself (or libraries for it, if it's not built into the language) will provide thread-safe constructs and you should use those rather than depend on your understanding (or possibly misunderstanding) of what machine code will be generated.
Things like Java synchronized and pthread_mutex_lock() (available to C under some OS') are what you want to look into.

If you use GLib, they have macros for int and pointer atomic operations.
http://library.gnome.org/devel/glib/stable/glib-Atomic-Operations.html

In recent versions of GCC there are __sync_xxx intrinsics to do exactly what you want.
Instead of writing:
x += 1;
write this:
__sync_fetch_and_add(&x, 1);
And gcc will make sure this will be compiled into an atomic opcode. This is supported on most important archs now.
http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
It stems originally from the recommendations from Intel for C on ia64, but now found it ways to gcc on a lot of other archs, too. So it's even a bit portable.

Is there any guarantees that C compiler will always use UP-safe instructions
Not in the C standard. But your compiler/standard library may provide you with special types or certain guarantees.
This gcc doc may be along the lines of what you need.

I believe you need to resort to SMP targeted libraries or else roll your own inline assembler code.

A C compiler may implement a statement like x = x + 1 in several instructions.
You can use the register keyword to hint the compiler to use a register instead of memory, but the compiler is free to ignore it.
I suggest to use OS lock specific routines like the InterlockedIncrement Function on Windows.

Worrying about just x86 is horribly unportable coding anyway. This is one of those seemingly small coding tasks that turns out to be a project in itself. Find an existing library project that solves this sort of problem for a wide range of platforms, and use it. GLib seems to be one, from what kaizer.se says.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight