Using GCC __sync extensions for a portable C library - c

I am developing a C library on OS X (10.10.x which happens to ship with GCC 4.2.x). This library is intended to be maximally portable and not specific to OS X.
I would like the end users to have the least headaches in building from source. So while the project is coded to std=c11 to get some of the benefits of the most modern C, it seems optional matter such as atomics are not supported by this version of GCC.
I am assuming GNU-Linux and various BSD end users to have either (a) a later version of GCC, or (b) the chops to install the latest and greatest.
Is it a good decision to rely on the __sync extensions of GCC for the required CAS (etc.) semantics?

I think you need to take a step back and first define all your use cases. The merits of __sync vs C11 atomics aside, better to define your needs first (i.e. __sync/atomics are solutions not needs).
The Linux kernel is one of the heaviest, most sophisticated users of locking, atomics, etc. and C11 atomics aren't powerful enough for it. See https://lwn.net/Articles/586838/
For example, you might be far better off wrapping things in pthread_mutex_lock / pthread_mutex_unlock pairs. Declaring a struct as C11 atomic does not guarantee atomic access to the whole struct, only parts of it. So, if you needed the following to be atomic:
glob.x = 5;
glob.y = 7;
glob.z = 9;
You would be better wrapping this in the pthread_mutex_* pairing. For comparison, inside the Linux kernel, this would be spin locks or RCU. In fact, you might use RCU as well. Note that doing:
CAS(glob.x,5)
CAS(glob.y,7)
CAS(glob.z,9)
is not the same as the mutex pairing if you want an all or nothing update.
I'd wrap your implementation in some thin layer. For example, the best way might be __sync on one arch [say BSD] and atomics on another. By abstracting this into a .h file with macros/inlines, you can write "common code" without lots of #ifdef's everywhere.
I wrote a ring queue struct/object. Its updater could use CAS [I wrote my own inline asm for this], pthread_mutex_*, kernel spin locks, etc. Actual choice of which was controlled by one or two #ifdef's inside my_ring_queue.h
Another advantage to abstraction: You can change your mind farther down the road. Suppose you did an early pick of __sync or atomics. You code this up in 200 places in 30 files. Then, comes the "big oops" where you realize this was the wrong choice. Lots of editing ensues. So, never put a naked [say] __sync_val_compare_and_swap in any of your .c files. Put it in once in my_atomics.h as something like #define MY_CAS_VAL(...) __sync_val_compare_and_swap(...) and use MY_CAS_VAL
You might also be able to reduce the number of places that need interthread locking by using thread local storage for certain things like subpool allocs/frees.
You may also want to use a mixture of CAS and lock pairings. Some specific uses fair better with low level CAS and others would be more efficient with mutex pairs. Again, it helps if you can define your needs first.
Also, consider the final disaster scenario: The compiler doesn't support atomics and __sync is not available [or does not work] for the arch you're compiling to. What then?
In that case, note that all __sync operations can be implemented using pthread_mutex pairings. That's your disaster fallback.

Related

Is it possible to generate ansi C functions with type information for a moving GC implementation?

I am wondering what methods there are to add typing information to generated C methods. I'm transpiling a higher-level programming language to C and I'd like to add a moving garbage collector. However to do that I need the method variables to have typing information, otherwise I could modify a primitive value that looks like a pointer.
An obvious approach would be to encapsulate all (primitive and non-primitive) variables in a struct that has an extra (enum) variable for typing information, however this would cause memory and performance overhead, the transpiled code is namely meant for embedded platforms. If I were to accept the memory overhead the obvious option would be to use a heap handle for all objects and then I'd be able to freely move heap blocks. However I'm wondering if there's a more efficient better approach.
I've come up with a potential solution, namely to predeclare and group variables based whether they're primitives or not (I can do that in the transpiler), and add an offset variable to each method at the end (I need to be able to find it accurately when scanning the stack area), that tells me where the non-primitive variables begin and where they end, so I can only scan those. This means that each method will use an additional 16/32-bit (depending on arch) of memory, however this should still be more memory efficient than the heap handle approach.
Example:
void my_func() {
int i = 5;
int z = 3;
bool b = false;
void* person;
void* person_info = ...;
.... // logic
volatile int offset = 0x034;
}
My aim is for something that works universally across GCC compilers, thus my concerns are:
Can the compiler reorder the variables from how they're declared in
the source code?
Can I force the compiler to put some data in the
method's stack frame (using volatile)?
Can I find the offset accurately when scanning the stack?
I'd like to avoid assembly so this approach can work (by default) across multiple platforms, however I'm open for methods even if they involve assembly (if they're reliable).
Typing information could be somehow encoded in the C function name; this is done by C++ and other implementations and called name mangling.
Actually, you could decide, since all your C code is generated, to adopt a different convention: generate long C identifiers which are practically unique and sort-of random program-wide, such as tiziw_7oa7eIzzcxv03TmmZ and keep their typing information elsewhere (e.g. some database). On Linux, such an approach is friendly to both libbacktrace and dlsym(3) + dladdr(3) (and of course nm(1) or readelf(1) or gdb(1)), so used in both bismon and RefPerSys projects.
Typing information is practically tied to calling conventions and ABIs. For example, the x86-64 ABI for Linux mandates different processor registers for passing floating points or pointers.
Read the Garbage Collection handbook or at least P.Wilson Uniprocessor Garbage Collection Techniques survey. You could decide to use tagged integers instead of boxing them, and you could decide to have a conservative GC (e.g. Boehm's GC) instead of a precise one. In my old GCC MELT project I generated C or C++ code for a generational copying GC. Similar techniques are used both in Bismon and in RefPerSys.
Since you are transpiling to C, consider also alternatives, such as libgccjit or LLVM. Look into libjit and asmjit.
Study also the implementation of other transpilers (compilers to C), including Chicken/Scheme and Bigloo.
Can the GCC compiler reorder the variables from how they're declared in the source code?
Of course yes, depending upon the optimizations you are asking. Some variables won't even exist in the binary (e.g. those staying in registers).
Can I force the compiler to put some data in the method's stack frame (using volatile)?
Better generate a single struct variable containing all your language variables, and leave optimizations to the compiler. You will be surprised (see this draft report).
Can I find the offset accurately when scanning the stack?
This is the most difficult, and depends a lot of compiler optimizations (e.g. if you run gcc with -O1 or -O3 on the generated C code; in some cases a recent GCC -e.g GCC 9 or GCC 10 on x86-64 for Linux- is capable of tail-call optimizations; check by compiling using gcc -O3 -S -fverbose-asm then looking into the produced assembler code). If you accept some small target processor and compiler specific tricks, this is doable. Study the implementation of the Ocaml compiler.
Send me (to basile#starynkevitch.net) an email for discussion. Please mention the URL of your question in it.
If you want to have an efficient generational copying GC with multi-threading, things become extremely tricky. The question is then how many years of development can you afford spending.
If you have exceptions in your language, take also a great care. You could with great caution generate calls to longjmp.
See of course this answer of mine.
With transpiling techniques, the evil is in the details
On Linux (specifically!) see also my manydl.c program. It demonstrates that on a Linux x86-64 laptop you could generate, in practice, hundred of thousands of dlopen(3)-ed plugins. Read then How to write shared libraries
Study also the implementation of SBCL and of GNU Prolog, at least for inspiration.
PS. The dream of a totally architecture-neutral and operating-system independent transpiler is an illusion.

What remains in C if I exclude libraries and compiler extensions?

Imagine a situation where you can't or don't want to use any of the libraries provided by the compiler as "standard", nor any external library. You can't use even the compiler extensions (such as gcc extensions).
What is the remaining part you get if you strip C language of all the things a lot of people use as a matter of course?
In such a way, probably a list of every callable function supported by any big C compiler (not only ANSI C) out-of-box would be satisfying as as answer as it'd at least approximately show the use-case of the language.
First I thought about sizeof() and printf() (those were already clarified in the comments - operator + stdio), so... what remains? In-line assembly seem like an extension too, so that pretty much strips even the option to use assembly with C if I'm right.
Probably in the matter of code it'd be easier to understand. Imagine a code compiled with only e.g. gcc main.c (output flag permitted) that has no #include, nor extern.
int main() {
// replace_me
return 0;
}
What can I call to actually do something else than "boring" type math and casting from type to type?
Note that switch, goto, if, loops and other constructs that do nothing and only allow repeating a piece of code aren't the thing I'm looking for (if it isn't obvious).
(Hopefully the edit clarified wtf I'm actually asking, but Matteo's answer pretty much did it.)
If you remove all libraries essentially you have something similar to a freestanding implementation of C (which still has to provide some libraries - say, string.h, but that's nothing you couldn't easily implement yourself in portable C), and that's what normally you start with when programming microcontrollers and other computers that don't have a ready-made operating system - and what operating system writers in general use when they compile their operating systems.
There you typically have two ways of doing stuff besides "raw" computation:
assembly blocks (where you can do literally anything the underlying machine can do);
memory mapped IO (you set a volatile pointer to some hardware dependent location and read/write from it; that affects hardware stuff).
That's really all you need to build anything - and after all, it all boils down to that stuff anyway, the C library of a regular hosted implementation is normally written in C itself, with some assembly used either for speed or to communicate with the operating system1 (typically the syscalls are invoked through some kind of interrupt).
Again, it's nothing you couldn't implement yourself. But the point of having a standard library is both to avoid to continuously reinvent the wheel, and to have a set of portable functions that spare you to have to rewrite everything knowing the details of each target platform.
And mainstream operating systems, in turn, are generally written in a mix or C and assembly as well.
C has no "built-in" functions as such. A compiler implementation may include "intrinsic" functions that are implemented directly by the compiler without provision of an external library, although a prototype declaration is still required for intrinsics, so you would still normally include a header file for such declarations.
C is a systems-level language with a minimal run-time and start-up requirement. Because it can directly access memory and memory mapped I/O there is very little that it cannot do (and what it cannot do is what you use assembly, in-line assembly or intrinsics for). For example, much of the library code you are wondering what you can do without is written in C. When running in an OS environment however (using C as an application-level rather then system-level language), you cannot practically use C in that manner - the OS has control over such things as I/O and memory-management and in modern systems will normally prevent unmediated access to such resources. Of course that OS itself is likely to largely written in C (and/or C++).
In a standalone of bare-metal environment with no OS, C is often used very early in the bootstrap process initialising hardware and establishing an application execution environment. In fact on ARM Cortex-M processors it is possible to boot directly into C code from reset, since the hardware loads an initial stack-pointer and start address from the vector table on start-up; this being enough to run C code that does not rely on library or static data initialisation - such initialisation can however be written in C before calling main().
Note that sizeof is not a function, it is an operator.
I don't think you really understand the situation.
You don't need a header to call a function in C. You can call with unchecked parameters - a bad idea and an obsolete feature, but still supported. And if a compiler links a library by default instead of only when you explicitly tell it to, that's only a little switch within the compiler to "link libc". Notoriously Unix compilers need to be told to link the math library, it wasn't linked by default because some very early programs didn't use floating point.
To be fair, some standard library functions like memcpy tend to be special-cased these days as they lend themselves to inlining and optimisation.
The standard library is documented and is usually available, though in effect deprecated by Microsoft for security reasons. You can write pretty much any function quite easily with only stdlib functions, what you can't do is fancy IO.

What is the difference between the __sync and __atomic intrinsics of gcc

I'm writing a toy operating system (so I cannot use any library, including the standard one), compiled with gcc, and I want to use atomics for some of the synchronization code. After some search, I found that gcc has two sets of builtins for atomic operations, __sync_* and __atomic_*, but there is no information as to the difference between the two.
What is the difference between these two besides the latter has a parameter for memory ordering? Is the __sync_ version equivalent to __atomic_ version with the sequential ordering? Is the __sync_ version deprecated in favor of the __atomic_ one?
Disclaimer: I have not used these primitives before. The following answer is based on my reading of the documentation and previous experience with concurrency.
Is the __sync_ version deprecated in favor of the __atomic_ one?
Yes, you should use __atomic and let the compiler fall back to __sync when necessary.
Is the __sync_ version equivalent to __atomic_ version with the sequential ordering?
No, the exact ordering guarantees are specified in the documentation for __sync. If you use __atomic, and the compiler chooses to fall back to __sync, then it will add code to meet the requested ordering guarantees.
From the documentation for __atomic:
Target architectures are encouraged to provide their own patterns for each of these built-in functions. If no target is provided, the original non-memory model set of ‘__sync’ atomic built-in functions are utilized, along with any required synchronization fences surrounding it in order to achieve the proper behavior. Execution in this case is subject to the same restrictions as those built-in functions.
A final word of caution: not all the __sync or __atomic operations can be implemented inline. The compiler may implement them as a call to an external function that is (presumably) implemented in the standard library. If you don't have access to the standard library, then you'll have to implement the missing functions yourself. Here is the relevant quote from the documentation:
If there is no pattern or mechanism to provide a lock free instruction sequence, a call is made to an external routine with the same parameters to be resolved at run time.
These primitives are a low-level mechanism, and you should understand what the compiler can and cannot do.
For an example of what code the compiler generates inline, see the related question: Atomic operations and code generation for gcc

How to create atomic section in c [duplicate]

Are there functions for performing atomic operations (like increment / decrement of an integer) etc supported by C Run time library or any other utility libraries?
If yes, what all operations can be made atomic using such functions?
Will it be more beneficial to use such functions than the normal synchronization primitives like mutex etc?
OS : Windows, Linux, Solaris & VxWorks
Prior to C11
The C library doesn't have any.
On Linux, gcc provides some -- look for __sync_fetch_and_add, __sync_fetch_and_sub, and so on.
In the case of Windows, look for InterlockedIncrement, InterlockedDecrement``, InterlockedExchange`, and so on. If you use gcc on Windows, I'd guess it also has the same built-ins as it does on Linux (though I haven't verified that).
On Solaris, it'll depend. Presumably if you use gcc, it'll probably (again) have the same built-ins it does under Linux. Otherwise, there are libraries floating around, but nothing really standardized.
C11
C11 added a (reasonably) complete set of atomic operations and atomic types. The operations include things like atomic_fetch_add and atomic_fetch_sum (and *_explicit versions of same that let you specify the ordering model you need, where the default ones always use memory_order_seq_cst). There are also fence functions, such as atomic_thread_fence and atomic_signal_fence.
The types correspond to each of the normal integer types--for example, atomic_int8_t corresponding to int8_t and atomic_uint_least64_t corrsponding to uint_least64_t. Those are typedef names defined in <stdatomic.h>. To avoid conflicts with any existing names, you can omit the header; the compiler itself uses names in the implementor's namespace (e.g., _Atomic_uint_least32_t instead of atomic_uint_least32_t).
'Beneficial' is situational. Always, performance depends on circumstances. You may expect something wonderful to happen when you switch out a mutex for something like this, but you may get no benefit (if it's not that popular of a case) or make things worse (if you accidently create a 'spin-lock').
Across all supported platforms, you can use use GLib's atomic operations. On platforms which have atomic operations built-in (e.g. assembly instructions), glib will use them. On other platforms, it will fall back to using mutexes.
I think that atomic operations can give you a speed boost, even if mutexes are implemented using them. With the mutex, you will have at least two atomic ops (lock & unlock), plus the actual operation. If the atomic op is available, it's a single operation.
Not sure what you mean by the C runtime library. The language proper, or the standard library does not provide you with any means to do this. You'd need to use a OS specific library/API. Also, don't be fooled by sig_atomic_t -- they are not what it seems at first glance and are useful only in the context of signal handlers.
On Windows, there are InterlockedExchange and the like. For Linux, you can take glibc's atomic macros - they're portable (see i486 atomic.h). I don't know a solution for the other operating systems.
In general, you can use the xchg instruction on x86 for atomic operations (works on Dual Core CPUs, too).
As to your second question, no, I don't think that using atomic operations will be faster than using mutexes. For instance, the pthreads library already implements mutexes with atomic operations, which is very fast.

Is there a way to test whether thread safe functions are available in the C standard library?

In regards to the thread safe functions in newer versions of the C standard library, is there a cross-platform way to tell if these are available via pre-processor definition? I am referring to functions such as localtime_r().
If there is not a standard way, what is the reliable way in GCC? [EDIT] Or posix systems with unistd.h?
There is no standard way to test that, which means there is no way to test it across all platforms. Tools like autoconf will create a tiny C program that calls this function and then try to compile and link it. It this works, looks like the function exists, if not, then it may not exist (or the compiler options are wrong and the appropriate CFLAGS need to be set).
So you have basically 6 options:
Require them to exist. Your code can only work on platforms where they exist; period. If they don't exist, compilation will fail, but that is not your problem, since the platform violates your minimum requirements.
Avoid using them. If you use the non-thread safe ones, maybe protected by a global lock (e.g. a mutex), it doesn't matter if they exist or not. Of course your code will then only work on platforms with POSIX mutexes, however, if a platform has no POSIX mutexes, it won't have POSIX threads either and if it has no POSIX threads (and I guess you are probably using POSIX threads w/o supporting any alternative), why would you have to worry about thread-safety in the first place?
Decide at runtime. Depending on the platform, either do a "weak link", so you can test at runtime if the function was found or not (a pointer to the function will point to NULL if it wasn't) or alternatively resolve the symbol dynamically using something like dlsym() (which is also not really portable, but widely supported in the Linux/UNIX world). However, in that case you need a fallback if the function is not found at runtime.
Use a tool like autoconf, some other tool with similar functionality, or your own configuration script to determine this prior to start of compilation (and maybe set preprocessor macros depending on result). In that case you will also need a fallback solution.
Limit usage to well known platforms. Whether this function is available on a certain platform is usually known (and once it is available, it won't go away in the future). Most platforms expose preprocessor macros to test what kind of platform that is and sometimes even which version. E.g. if you know that GNU/Linux, Android, Free/Open/NetBSD, Solaris, iOS and MacOS X all offer this function, test if you are compiling for one of these platforms and if yes, use it. If the code is compiled for another platform (or if you cannot determine what platform that is), it may or may not offer this function, but since you cannot say for sure, better be safe and use the fallback.
Let the user decide. Either always use the fallback, unless the user has signaled support or do it the other way round (which makes probably more sense), always assume it is there and in case compilation fails, offer a way the user can force your code into "compatibility mode", by somehow specifying that thread-safe-functions are not available (e.g. by setting an environment variable or by using a different make target). Of course this is the least convenient method for the (poor) user.

Resources